基于注意力网络的语体多元特征挖掘

doi:10.11772/j.issn.1001-9081.2019122204

计算机应用 ›› 2020, Vol. 40 ›› Issue (8): 2171-2181.DOI: 10.11772/j.issn.1001-9081.2019122204

• 人工智能 • 下一篇

基于注意力网络的语体多元特征挖掘

吴海燕, 刘颖

清华大学人文学院, 北京 100084

收稿日期:2020-01-02 修回日期:2020-03-16 出版日期:2020-08-10 发布日期:2020-08-21
通讯作者: 刘颖(1969-),女,内蒙古赤峰人,教授,博士,CCF会员,主要研究方向:语料库语言学、计算语言学、机器翻译。yingliu@mail.tsinghua.edu.cn
作者简介:吴海燕(1985-),女,陕西延安人,博士研究生,CCF会员,主要研究方向:自然语言处理。
基金资助:
国家社会科学基金资助项目（18ZDA238）；教育部人文社科一般项目（17YJAZH056）；北京社会科学基金资助项目（16YYB021）。

Stylistic multiple features mining based on attention network

WU Haiyan, LIU Ying

School of Humanities, Tsinghua University, Beijing 100084, China

Received:2020-01-02 Revised:2020-03-16 Online:2020-08-10 Published:2020-08-21
Supported by:
This work is partially supported by the 2018 National Major Program of Philosophy and Social Science Fund(18ZDA238), the China's Ministry of Education Project of Humanities and Social Sciences (17YJAZH056), the Beijing Social Science Fund (16YYB021).

摘要/Abstract

摘要： 针对大规模语料中不同语体的特征难以挖掘、需要大量专业知识和人力的问题，提出了一种自动挖掘能区分不同语体的特征的方法。首先，将语体表示成词、词类、标点符号、它们的2元、句法结构及多种组合特征；然后，使用注意力机制和多层感知机（MLP）的组合模型（如注意力网络）把语体分类成小说、新闻和课本，并在过程中自动地提取出能够帮助区分语体的重要特征；最后，通过对这些特征的进一步分析，可以得到不同语体的特点及一些语言学结论。实验结果显示，小说、新闻和课本在词、主题词、词的依存关系、词类、标点符号和句法结构都有显著的差异，进一步表明了人们在使用语言时因交际对象、目的、内容和环境的不同，对词汇、词类、标点和句法的运用上会自然地呈现出某种不同。

关键词: 语体特征挖掘, 语体特征区分度, 注意力机制, 多层感知机

Abstract: To solve the problem that it is difficult to mine the features of different registers in large-scale corpus and it needs a lot of professional knowledge and manpower, a method to mine the features of distinguishing different registers automatically was proposed. First, the register was expressed as words, parts-of-speech, punctuations, and their bigrams, syntactic structure as well as multiple combined features. Then, the combination model of attention mechanism and Multi-Layer Perceptron (MLP) (i.e. attention network) was used to classify the registers into novel, news and textbook. And, the important features that were able to help to distinguish the registers were automatically extracted in this process. Finally, through the further analysis of these features, the characteristics of different registers and some linguistic conclusions were obtained. Experimental results show that novel, news, and textbook have significant differences in words, topic words, word dependencies, parts-of-speech, punctuations and syntactic structures, which implies that there will naturally present some diversity in the use of words, parts-of-speech, punctuations, and syntactic structures due to the different communication objects, purposes, contents, and environments when people utilize language.

Key words: stylistic feature mining, discrimination measure of stylistic feature, attention mechanism, Multi-Layer Perception (MLP)

中图分类号:

TP391.1

吴海燕, 刘颖. 基于注意力网络的语体多元特征挖掘[J]. 计算机应用, 2020, 40(8): 2171-2181.

WU Haiyan, LIU Ying. Stylistic multiple features mining based on attention network[J]. Journal of Computer Applications, 2020, 40(8): 2171-2181.

参考文献

[1] 霍小立. 语体特征及其影响变量的研究[D]. 南京:南京大学, 2014:1-5. (HUO X L. The study on stylistic features and their influencing variables[D]. Nanjing:Nanjing University, 2014:1-5.)
[2] 陶红印,刘娅琼. 从语体差异到语法差异(下)——以自然会话与影视对白中的把字句、被动结构、光杆动词句、否定反问句为例[J]. 当代修辞学, 2010(2):22-27. (TAO H Y, LIU Y Q. From stylistic differences to grammatical differences (part two)-the cases of ba-repair, passive constructions, bare verb predicates, and negative interrogatives between natural conversation and media dialogue[J]. Contemporary Rhetoric, 2010(2):22-27.)
[3] 冯胜利. 语体语法的逻辑体系及语体特征的鉴定[J]. 汉语应用语言学研究, 2015(1):1-21. (FENG S L. The logical system of stylistic grammar and the identification of stylistic features[J]. Research on Chinese Applied Linguistics, 2015(1):1-21.)
[4] 张豫峰. "得"字句与语体的关系[J]. 河南大学学报(社会科学版), 2000, 40(1):105-108. (ZHANG Y F. The relationship between "De" sentence and style[J]. Journal of Henan University (Social Science), 2000, 40(1):105-108.)
[5] 钱小飞. "地"字结构识别[J]. 现代语文(语言研究), 2006(5):61-63. (QIAN X F. The "De" word structure recognition[J]. Modern Chinese, 2006(5):61-63.)
[6] 方梅. 谈语体特征的句法表现[J]. 当代修辞学, 2013(2):9-16. (FANG M. Talking about the syntactic representation of stylistic features[J]. Contemporary Rhetoric, 2013(2):9-16.)
[7] 林毓霞. 书面语体与标点符号[J]. 当代修辞学, 1987(3):11-12. (LIN Y X. Written style and punctuation[J]. Contemporary Rhetoric, 1987(3):11-12.)
[8] 胡骏飞,陶红印. 基于语料库的"弄"字句及物性研究[J]. 外语教学与研究, 2017, 49(1):64-72. (HU J F, TAO H Y. A corpus-based study on the transitivity of "Nong" sentences and physical properties[J]. Foreign Language Teaching and Research, 2017, 49(1):64-72.)
[9] 肖天久,刘颖. 《红楼梦》词和N元文法分析[J]. 现代图书情报技术, 2015, 31(4):50-57. (XIAO T J, LIU Y. An analysis of the words and N-grams for "A Dream of Red Mansions"[J]. New Technology of Library and Information Service, 2015, 31(4):50-57.)
[10] 周浩. 基于神经网络的句法分析研究[D]. 南京:南京大学, 2017:30-53. (ZHOU H. Research on syntactic analysis based on neural network[D]. Nanjing:Nanjing University, 2017:30-53.)
[11] WANG Y, ZHANG J. Keyword extraction from online product reviews based on bi-directional LSTM recurrent neural network[C]//Proceedings of the 2017 IEEE International Conference on Industrial Engineering and Engineering Management. Piscataway:IEEE, 2017:2241-2245.
[12] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL].[2019-11-03].https://arxiv.org/pdf/1409.0473.pdf.
[13] PAPPAS N, POPESCU-BELIS A. Multilingual hierarchical attention networks for document classification[EB/OL].[2019-11-03].https://arxiv.org/pdf/1707.00896.pdf.
[14] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553):436-444.
[15] KALMAN B L, KWASNY S C. Why tanh:choosing a sigmoidal function[C]//Proceedings of the 1992 International Joint Conference on Neural Networks. Piscataway:IEEE, 1992:578-581.
[16] SIDOROV G, VELASQUEZ F, STAMATATOS E, et al. Syntactic dependency-based n-grams as classification features[C]//Proceedings of the 11th Mexican International Conference on Artificial Intelligence, LNCS 7630. Berlin:Springer, 2012:1-11.

基于注意力网络的语体多元特征挖掘

Stylistic multiple features mining based on attention network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	代雨柔, 杨庆, 张凤荔, 周帆. 基于自监督学习的社交网络用户轨迹预测模型[J]. 计算机应用, 2021, 41(9): 2545-2551.
[2]	刘雅璇, 钟勇. 基于头实体注意力的实体关系联合抽取方法[J]. 计算机应用, 2021, 41(9): 2517-2522.
[3]	李康康, 张静. 基于注意力机制的多层次编码和解码的图像描述模型[J]. 计算机应用, 2021, 41(9): 2504-2509.
[4]	赵宏, 孔东一. 图像特征注意力与自适应注意力融合的图像内容中文描述[J]. 计算机应用, 2021, 41(9): 2496-2503.
[5]	党伟超, 李涛, 白尚旺, 高改梅, 刘春霞. 基于自注意力长短期记忆网络的Web软件系统实时剩余寿命预测方法[J]. 计算机应用, 2021, 41(8): 2346-2351.
[6]	李朝, 兰海, 魏宪. 基于注意力的毫米波-激光雷达融合目标检测[J]. 计算机应用, 2021, 41(7): 2137-2144.
[7]	李扬志, 袁家政, 刘宏哲. 基于时空注意力图卷积网络模型的人体骨架动作识别算法[J]. 计算机应用, 2021, 41(7): 1915-1921.
[8]	张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.
[9]	高钦泉, 黄炳城, 刘文哲, 童同. 基于改进CenterNet的竹条表面缺陷检测方法[J]. 计算机应用, 2021, 41(7): 1933-1938.
[10]	武维, 李泽平, 杨华蔚, 林川, 王忠德. 融合内容特征和时序信息的深度注意力视频流行度预测模型[J]. 计算机应用, 2021, 41(7): 1878-1884.
[11]	刘世泽, 朱奕达, 陈润泽, 罗海勇, 赵方, 孙艺, 王宝会. 基于残差时域注意力神经网络的交通模式识别算法[J]. 计算机应用, 2021, 41(6): 1557-1565.
[12]	李想, 王卫兵, 尚学达. 指针生成网络和覆盖损失优化的Transformer在生成式文本摘要领域的应用[J]. 计算机应用, 2021, 41(6): 1647-1651.
[13]	沈雪雯, 王晓东, 姚宇. 基于空间分频的超声图像分割注意力网络[J]. 计算机应用, 2021, 41(6): 1828-1835.
[14]	赖雪梅, 唐宏, 陈虹羽, 李珊珊. 基于注意力机制的特征融合-双向门控循环单元多模态情感分析[J]. 计算机应用, 2021, 41(5): 1268-1274.
[15]	胡嵽, 冯子亮. 基于深度学习的轻量级道路图像语义分割算法[J]. 计算机应用, 2021, 41(5): 1326-1331.