Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (7): 1897-1901.DOI: 10.11772/j.issn.1001-9081.2020101528

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Authorship identification of text based on attention mechanism

ZHANG Yang, JIANG Minghu   

  1. School of Humanities, Tsinghua University, Beijing 100084, China
  • Received:2020-10-08 Revised:2020-12-15 Online:2021-07-10 Published:2021-01-27
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (62036001).

基于注意力机制的文本作者识别

张洋, 江铭虎   

  1. 清华大学 人文学院, 北京 100084
  • 通讯作者: 江铭虎
  • 作者简介:张洋(1990-),男,山东济南人,博士研究生,主要研究方向:作者身份识别、文本分类、情感分析;江铭虎(1962-),男,江苏苏州人,教授,博士,主要研究方向:自然语言处理、神经网络语言处理、模式识别、人工智能。
  • 基金资助:
    国家自然科学基金资助项目(62036001)。

Abstract: The accuracy of authorship identification based on deep neural network decreases significantly when faced with a large number of candidate authors. In order to improve the accuracy of authorship identification, a neural network consisting of fast text classification (fastText) and an attention layer was proposed, and it was combined with the continuous Part-Of-Speech (POS) n-gram features for authorship identification of Chinese novels. Compared with Text Convolutional Neural Network (TextCNN), Text Recurrent Neural Network (TextRNN), Long Short-Term Memory (LSTM) network and fastText, the experimental results show that the proposed model obtains the highest classification accuracy. Compared with the fastText model, the introduction of attention mechanism increases the accuracy corresponding to different POS n-gram features by 2.14 percentage points on average; meanwhile, the model retains the high-speed and efficiency of fastText, and the text features used by it can be applied to other languages.

Key words: authorship identification, Part-Of-Speech (POS) n-gram, neural network, fast text classification (fastText), attention mechanism

摘要: 基于神经网络的作者识别在面临较多候选作者时识别准确率会大幅降低。为了提高作者识别精度,提出一种由快速文本分类(fastText)和注意力层构成的神经网络,并将该网络结合连续的词性标签n元组合(POS n-gram)特征进行中文小说的作者识别。与文本卷积神经网络(TextCNN)、文本循环神经网络(TextRNN)、长短期记忆(LSTM)网络和fastText进行对比,实验结果表明,所提出的模型获得了最高的分类准确率,与fastText模型相比,注意力机制的引入使得不同POS n-gram特征对应的准确率平均提高了2.14个百分点;同时,该模型保留了fastText的快速高效,且其所使用的文本特征可以推广到其他语言上。

关键词: 作者识别, 词性标签n元组合, 神经网络, 快速文本分类, 注意力机制

CLC Number: