计算机应用 ›› 2020, Vol. 40 ›› Issue (8): 2171-2181.DOI: 10.11772/j.issn.1001-9081.2019122204

• 人工智能 •    下一篇

基于注意力网络的语体多元特征挖掘

吴海燕, 刘颖   

  1. 清华大学 人文学院, 北京 100084
  • 收稿日期:2020-01-02 修回日期:2020-03-16 出版日期:2020-08-10 发布日期:2020-08-21
  • 通讯作者: 刘颖(1969-),女,内蒙古赤峰人,教授,博士,CCF会员,主要研究方向:语料库语言学、计算语言学、机器翻译。yingliu@mail.tsinghua.edu.cn
  • 作者简介:吴海燕(1985-),女,陕西延安人,博士研究生,CCF会员,主要研究方向:自然语言处理。
  • 基金资助:
    国家社会科学基金资助项目(18ZDA238);教育部人文社科一般项目(17YJAZH056);北京社会科学基金资助项目(16YYB021)。

Stylistic multiple features mining based on attention network

WU Haiyan, LIU Ying   

  1. School of Humanities, Tsinghua University, Beijing 100084, China
  • Received:2020-01-02 Revised:2020-03-16 Online:2020-08-10 Published:2020-08-21
  • Supported by:
    This work is partially supported by the 2018 National Major Program of Philosophy and Social Science Fund(18ZDA238), the China's Ministry of Education Project of Humanities and Social Sciences (17YJAZH056), the Beijing Social Science Fund (16YYB021).

摘要: 针对大规模语料中不同语体的特征难以挖掘、需要大量专业知识和人力的问题,提出了一种自动挖掘能区分不同语体的特征的方法。首先,将语体表示成词、词类、标点符号、它们的2元、句法结构及多种组合特征;然后,使用注意力机制和多层感知机(MLP)的组合模型(如注意力网络)把语体分类成小说、新闻和课本,并在过程中自动地提取出能够帮助区分语体的重要特征;最后,通过对这些特征的进一步分析,可以得到不同语体的特点及一些语言学结论。实验结果显示,小说、新闻和课本在词、主题词、词的依存关系、词类、标点符号和句法结构都有显著的差异,进一步表明了人们在使用语言时因交际对象、目的、内容和环境的不同,对词汇、词类、标点和句法的运用上会自然地呈现出某种不同。

关键词: 语体特征挖掘, 语体特征区分度, 注意力机制, 多层感知机

Abstract: To solve the problem that it is difficult to mine the features of different registers in large-scale corpus and it needs a lot of professional knowledge and manpower, a method to mine the features of distinguishing different registers automatically was proposed. First, the register was expressed as words, parts-of-speech, punctuations, and their bigrams, syntactic structure as well as multiple combined features. Then, the combination model of attention mechanism and Multi-Layer Perceptron (MLP) (i.e. attention network) was used to classify the registers into novel, news and textbook. And, the important features that were able to help to distinguish the registers were automatically extracted in this process. Finally, through the further analysis of these features, the characteristics of different registers and some linguistic conclusions were obtained. Experimental results show that novel, news, and textbook have significant differences in words, topic words, word dependencies, parts-of-speech, punctuations and syntactic structures, which implies that there will naturally present some diversity in the use of words, parts-of-speech, punctuations, and syntactic structures due to the different communication objects, purposes, contents, and environments when people utilize language.

Key words: stylistic feature mining, discrimination measure of stylistic feature, attention mechanism, Multi-Layer Perception (MLP)

中图分类号: