Target-dependent method for authorship attribution

doi:10.11772/j.issn.1001-9081.2019101768

Abstract

Abstract:

Authorship attribution is the task of deciding who is the author of a particular document， however， the traditional methods for authorship attribution are target-independent without considering any constraint during the prediction of authorship， which is inconsistent with the actual problems. To address the above issue， a Target-Dependent method for Authorship Attribution （TDAA） was proposed. Firstly， the product ID corresponding to the user review was chosen to be the constraint information. Secondly， Bidirectional Encoder Representation from Transformer （BERT） was used to extract the pre-trained review text feature to make the text modeling process more universal. Thirdly， the Convolutional Neural Network （CNN） was used to extract the deep features of the text. Finally， two fusion methods were proposed to fuse the two different information. Experimental results on Amazon Movie_and_TV dataset and CDs_and_Vinyl_5 dataset show that the proposed method can increase the accuracy by 4%-5% compared with the comparison methods.

Key words: authorship attribution, target-dependent, Convolutional Neural Network (CNN), information fusion, pre-trained language model

摘要：

作者身份识别任务旨在判断一篇文档的作者，但目前已有的作者身份识别方法都是目标独立的，意味着这些方法在预测作者身份时假设没有任何限定条件，这与实际情况不相符合。为了解决限定条件下的作者身份识别问题，提出了一种目标依赖的作者身份识别方法TDAA。首先，使用用户评论对应的商品ID作为限定信息；其次，为了使文本建模过程更加具有普适性，使用BERT提取预训练的评论文本特征；然后，使用卷积神经网络（CNN）进行深层次的文本特征提取；最后，为了将两种不同的信息融合起来，讨论了两种不同的融合方式。在亚马逊电影评论（Amazon Movie_and_TV）和CD评论（CDs_and_Vinyl_5）两个数据集上的实验结果表明，所提出的方法在精确率评价指标上较对比方法提高了4%~5%。

关键词: 作者身份识别, 目标依赖, 卷积神经网络, 信息融合, 预训练语言模型

CLC Number:

TP391.1

Yang LI, Wei ZHANG, Chen PENG. Target-dependent method for authorship attribution[J]. Journal of Computer Applications, 2020, 40(2): 473-478.

李扬, 张伟, 彭晨. 目标依赖的作者身份识别方法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 473-478.

Figures/Tables 11

Tab. 1 Symbol definition

符号	描述	符号	描述
$L$	文本最大长度	$B ∈ R \| C \| × d$	词向量表
$m$	卷积核的数目	$E ∈ R s × L × d$	文档向量
$σ$	激活函数	$P ∈ R b × d$	商品ID向量表
$C o n v 2 D$	二维卷积操作	$p$	商品ID向量
$d$	词向量维度

Tab. 1 Symbol definition

符号	描述	符号	描述
$L$	文本最大长度	$B ∈ R \| C \| × d$	词向量表
$m$	卷积核的数目	$E ∈ R s × L × d$	文档向量
$σ$	激活函数	$P ∈ R b × d$	商品ID向量表
$C o n v 2 D$	二维卷积操作	$p$	商品ID向量
$d$	词向量维度

Fig. 1 Pre-trained document feature extraction

Fig. 2 CNN based on document vector

Fig. 3 Earlier-stage fusion model

Fig. 4 Later-stage fusion model

Tab. 2 Dataset statistics

数据集	商品数量	用户数量	评论数/用户	评论数/商品	总评论数
电影评论	250	610	37.37	91.17	22 793
CD评论	600	800	51.27	38.45	30 763

Tab. 3 Neural network architecture and hyperparameters

名称	层数	数值
最大长度L	—	1 000
向量维度d	—	300
卷积	3	$m = 300, w = [1,2, 3]$
全连接	1	# of classes

Tab. 3 Neural network architecture and hyperparameters

名称	层数	数值
最大长度L	—	1 000
向量维度d	—	300
卷积	3	$m = 300, w = [1,2, 3]$
全连接	1	# of classes

Tab. 4 Comparison of evaluation results of different methods on two datasets

方法	电影评论数据集			CD评论数据集
方法	Acc	R_macro	F1_macro	Acc	R_macro	F1_macro
CNN-2	0.519	0.411	0.415	0.683	0.581	0.579
LSTM-1	0.363	0.262	0.259	0.464	0.362	0.363
SVM	0.452	0.354	0.351	0.619	0.523	0.521
RF	0.307	0.209	0.205	0.492	0.401	0.399
Syntax-CNN	0.505	0.401	0.405	0.656	0.566	0.565
LDA-S	0.285	0.188	0.186	0.349	0.251	0.252
CNN product	0.018	0.006	0.003	0.012	0.003	0.004
前期融合	0.556	0.449	0.443	0.708	0.612	0.608
后期融合	0.569	0.467	0.465	0.725	0.621	0.622

Tab. 5 Impact of target-dependence information on Acc based on n-gram feature

方法	电影评论	CD评论
CNN-2	0.519	0.682
前期融合	0.522	0.686
后期融合	0.540	0.706

Tab. 6 Impact of target-dependence information on Acc based on pre-trained feature

方法	电影评论	CD评论
CNN-2	0.548	0.703
前期融合	0.554	0.710
后期融合	0.568	0.725

Fig.5 Impact of different n-gram length on Acc on two datasets

References 19

1	SCHWARTZ R， TSUR O， RAPPOPORT A， et al. Authorship attribution of micro-messages［C］// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2013： 1880-1891.
2	LAYTON R， WATTERS P， DAZELEY R. Authorship attribution for twitter in 140 characters or less［C］// Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop. Piscataway： IEEE， 2010： 1-8. 10.1109/ctc.2010.17
3	KOPPEL M， SCHLER J. Authorship verification as a one-class classification problem［C］// Proceedings of the 21st International Conference on Machine Learning. New York： ACM， 2004： 1-7. 10.1145/1015330.1015448
4	TANG D， QIN B， LIU T. Document modeling with gated recurrent neural network for sentiment classification［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2015： 1422-1432. 10.18653/v1/d15-1167
5	TAI K S， SOCHER R， MANNING C D. Improved semantic representations from tree-structured long short-term memory networks［EB/OL］. ［2019-02-20］. . 10.3115/v1/p15-1150
6	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1746-1751. 10.3115/v1/d14-1181
7	ZHANG X， ZHAO J， LECUN Y. Character-level convolutional networks for text classification［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge， MA： MIT Press， 2015： 649-657. 10.1109/icip.2015.7351229
8	ZHANG W， YUAN Q， HAN J， et al. Collaborative multi-Level embedding learning from reviews for rating prediction［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2016： 2986-2992. 10.1609/aaai.v34i04.5826
9	ZHANG W， WANG J. Integrating topic and latent factors for scalable personalized review-based rating prediction［J］. IEEE Transactions on Knowledge and Data Engineering， 2016， 28（11）： 3013-3027. 10.1109/tkde.2016.2598740
10	SEROUSSI Y， ZUKERMAN I， BOHNERT F. Authorship attribution with latent Dirichlet allocation［C］// Proceedings of the 15th Conference on Computational Natural Language Learning. Stroudsburg， PA： Association for Computational Linguistics， 2011： 181-189. 10.1145/1995966.1995976
11	ZHANG R， HU Z， GUO H， et al. Syntax encoding with application in authorship attribution［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2018： 2742-2753. 10.18653/v1/d18-1294
12	MIKOLOV T， SUTSKEVER I， CHEN K， et al. Distributed representations of words and phrases and their compositionality［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. New York： Curran Associates Inc.， 2013： 3111-3119.
13	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［EB/OL］. ［2019-02-20］. . 10.18653/v1/n19-1423
14	ATREY P K， HOSSAIN M A， SADDIK A EL， et al. Multimodal fusion for multimedia analysis： a survey［J］. Multimedia Systems， 2010， 16（6）： 345-379. 10.1007/s00530-010-0182-0
15	SHRESTHA P， SIERRA S， GONZÁLEZ F， et al. Convolutional neural networks for authorship attribution of short texts［C］// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2017： 669-674. 10.18653/v1/e17-2106
16	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. ［2019-02-20］. .
17	LI Y， YE J. Learning adversarial networks for semi-supervised text classification via policy gradient［C］// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2018： 1715-1723. 10.1145/3219819.3219956
18	SABOUR S， FROSST N， HINTON G E. Dynamic routing between capsules［C］// Proceedings of the 2017 Conference on Neural Information Processing Systems.［S.l.］： CUED Publications database， 2017： 3856-3866.
19	ZHAO W， YE J， YANG M， et al. Investigating capsule networks with dynamic routing for text classification［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2018：3110-3119. 10.18653/v1/d18-1350

[1]	Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910.
[2]	Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU. Text-to-SQL model based on semantic enhanced schema linking [J]. Journal of Computer Applications, 2024, 44(9): 2689-2695.
[3]	Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499.
[4]	Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994.
[5]	Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242.
[6]	Chao WEI, Yanping CHEN, Kai WANG, Yongbin QIN, Ruizhang HUANG. Relation extraction method based on mask prompt and gated memory network calibration [J]. Journal of Computer Applications, 2024, 44(6): 1713-1719.
[7]	Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919.
[8]	Jianjing LI, Guanfeng LI, Feizhou QIN, Weijun LI. Multi-relation approximate reasoning model based on uncertain knowledge graph embedding [J]. Journal of Computer Applications, 2024, 44(6): 1751-1759.
[9]	Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545.
[10]	Wenshuo GAO, Xiaoyun CHEN. Point cloud classification network based on node structure [J]. Journal of Computer Applications, 2024, 44(5): 1471-1478.
[11]	Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113.
[12]	Tianhua CHEN, Jiaxuan ZHU, Jie YIN. Bird recognition algorithm based on attention mechanism [J]. Journal of Computer Applications, 2024, 44(4): 1114-1120.
[13]	Lijun XU, Hui LI, Zuyang LIU, Kansong CHEN, Weixuan MA. 3D-GA-Unet： MRI image segmentation algorithm for glioma based on 3D-Ghost CNN [J]. Journal of Computer Applications, 2024, 44(4): 1294-1302.
[14]	Ruifeng HOU, Pengcheng ZHANG, Liyuan ZHANG, Zhiguo GUI, Yi LIU, Haowen ZHANG, Shubin WANG. Iterative denoising network based on total variation regular term expansion [J]. Journal of Computer Applications, 2024, 44(3): 916-921.
[15]	Jingxian ZHOU, Xina LI. UAV detection and recognition based on improved convolutional neural network and radio frequency fingerprint [J]. Journal of Computer Applications, 2024, 44(3): 876-882.