Authorship identification of text based on attention mechanism

doi:10.11772/j.issn.1001-9081.2020101528

Abstract

Abstract: The accuracy of authorship identification based on deep neural network decreases significantly when faced with a large number of candidate authors. In order to improve the accuracy of authorship identification, a neural network consisting of fast text classification (fastText) and an attention layer was proposed, and it was combined with the continuous Part-Of-Speech (POS) n-gram features for authorship identification of Chinese novels. Compared with Text Convolutional Neural Network (TextCNN), Text Recurrent Neural Network (TextRNN), Long Short-Term Memory (LSTM) network and fastText, the experimental results show that the proposed model obtains the highest classification accuracy. Compared with the fastText model, the introduction of attention mechanism increases the accuracy corresponding to different POS n-gram features by 2.14 percentage points on average; meanwhile, the model retains the high-speed and efficiency of fastText, and the text features used by it can be applied to other languages.

Key words: authorship identification, Part-Of-Speech (POS) n-gram, neural network, fast text classification (fastText), attention mechanism

摘要： 基于神经网络的作者识别在面临较多候选作者时识别准确率会大幅降低。为了提高作者识别精度，提出一种由快速文本分类（fastText）和注意力层构成的神经网络，并将该网络结合连续的词性标签n元组合（POS n-gram）特征进行中文小说的作者识别。与文本卷积神经网络（TextCNN）、文本循环神经网络（TextRNN）、长短期记忆（LSTM）网络和fastText进行对比，实验结果表明，所提出的模型获得了最高的分类准确率，与fastText模型相比，注意力机制的引入使得不同POS n-gram特征对应的准确率平均提高了2.14个百分点；同时，该模型保留了fastText的快速高效，且其所使用的文本特征可以推广到其他语言上。

关键词: 作者识别, 词性标签n元组合, 神经网络, 快速文本分类, 注意力机制

CLC Number:

TP391

ZHANG Yang, JIANG Minghu. Authorship identification of text based on attention mechanism[J]. Journal of Computer Applications, 2021, 41(7): 1897-1901.

张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.

References

[1] 祁瑞华. 文本作者身份识别——基于机器学习与计算语言学[M]. 北京:清华大学出版社,2017:1.(QI R H. Text Authorship Identification-based on Machine Learning and Computational Linguistics[M]. Beijing:Tsinghua University Press,2017:1.)
[2] STAMATATOS E. A survey of modern authorship attribution methods[J]. Journal of the American Society for Information Science and Technology,2009,60(3):538-556.
[3] CERRA D,DATCU M,REINARTZ P. Authorship analysis based on data compression[J]. Pattern Recognition Letters,2014,42:79-84.
[4] POTHA N, STAMATATOS E. A profile-based method for authorship verification[C]//Proceedings of the 8th Hellenic Conference on Artificial Intelligence, LNCS 8445. Cham:Springer,2014:313-326.
[5] MA J,XUE B,ZHANG M. A profile-based authorship attribution approach to forensic identification in Chinese online messages[C]//Proceedings of the 11th Pacific-Asia Workshop on Intelligence and Security Informatics,LNCS 9650. Cham:Springer,2016:33-52.
[6] KOPPEL M,SCHLER J,ARGAMON S. Authorship attribution in the wild[J]. Language Resources and Evaluation,2011,45(1):83-94.
[7] MOSTELLER F, WALLACE D L. Inference and Disputed Authorship:the Federalist[M]. Wokingham:Addison-Wesley Publishing Company,1964.
[8] 张学工. 模式识别[M]. 3版. 北京:清华大学出版社,2010:71-73. (ZHANG X G. Pattern Recognition[M]. 3rd ed. Beijing:Tsinghua University Press,2010:71-73.)
[9] ZHAO Y,ZOBEL J. Searching with style:authorship attribution in classic literature[C]//Proceedings of the 30th Australasian Conference on Computer Science. Sydney:Australian Computer Society,Inc.,2007:59-68.
[10] RAGHAVAN S, KOVASHKA A, MOONEY R. Authorship attribution using probabilistic context-free grammars[C]//Proceedings of the ACL 2010 Conference Short Papers. Stroudsburg, PA:Association for Computational Linguistics, 2010:38-42.
[11] BOUTWELL S R. Authorship attribution of short messages using multimodal features[D]. Monterey, CA:Naval Postgraduate School,2011:39-55.
[12] SAVOY J. Authorship attribution based on a probabilistic topic model[J]. Information Processing and Management,2013,49(1):341-354.
[13] SCHÖLKOPF B,SMOLA A J. Learning with Kernels:Support Vector Machines,Regularization,Optimization,and Beyond[M]. Cambridge:MIT Press,2018:15-16.
[14] DIEDERICH J, KINDERMANN J, LEOPOLD E, et al. Authorship attribution with support vector machines[J]. Applied Intelligence,2003,19(1/2):109-123.
[15] SCHWARTZ R,TSUR O,RAPPOPORT A,et al. Authorship attribution of micro-messages[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2013:1880-1891.
[16] MIKROS G K,PERIFANOS K. Authorship attribution in Greek tweets using author's multilevel n-gram profiles[C]//Proceedings of the 2013 AAAI Spring Symposium. Palo Alto,CA:AAAI Press,2013:17-23.
[17] POSADAS-DURAN J P,SIDOROV G,BATYRSHIN I. Complete syntactic n-grams as style markers for authorship attribution[C]//Proceedings of the 13th Mexican International Conference on Artificial Intelligence,LNCS 8856. Cham:Springer,2014:9-17.
[18] GURNEY K. An Introduction to Neural Networks[M]. Boca Raton:CRC Press,1997:1-3.
[19] GRAUPE D. Principles of Artificial Neural Networks[M]. 2nd ed. Singapore:World Scientific Publishing,2007:2-4.
[20] BAGNALL D. Author identification using multi-headed recurrent neural networks[EB/OL].[2020-06-16]. https://arxiv.org/ftp/arxiv/papers/1506/1506.04891.pdf.
[21] RUDER S,GHAFFARI P,BRESLIN J G. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution[EB/OL].[2020-09-21]. https://arxiv.org/pdf/1609.06686.pdf.
[22] SHRESTHA P,SIERRA S,GONZÁLEZ F,et al. Convolutional neural networks for authorship attribution of short texts[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2017:669-674.
[23] JAFARIAKINABAD F,TARNPRADAB S,HUA K A. Syntactic recurrent neural network for authorship attribution[EB/OL].[2020-02-26]. https://arxiv.org/pdf/1902.09723.pdf.
[24] QIAN T Y,LIU B,LI Q,et al. Review authorship attribution in a similarity space[J]. Journal of Computer Science and Technology, 2015,30(1):200-213.
[25] TRSTENJAK B,MIKAC S,DONKO D. KNN with TF-IDF based framework for text categorization[J]. Procedia Engineering, 2014,69:1356-1364.
[26] JANKOWSKA M,MILIOS E,KEŠELJ V. Author verification using common n-gram profiles of text documents[C]//Proceedings of the 25th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2014:387-397.
[27] BURROWS J. ‘Delta’:a measure of stylistic difference and a guide to likely authorship[J]. Literary and Linguistic Computing, 2002,17(3):267-287.
[28] EDER M. Does size matter? Authorship attribution, small samples,big problem[J]. Digital Scholarship in the Humanities, 2015,30(2):167-182.
[29] JOULIN A,GRAVE E,BOJANOWSKI P,et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2017:427-431.
[30] SARI Y,VLACHOS A,STEVENSON M. Continuous n-gram representations for authorship attribution[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2017:267-273.