基于多元判别分析的汉语句群自动划分方法

doi:10.11772/j.issn.1001-9081.2015.05.1314

计算机应用 ›› 2015, Vol. 35 ›› Issue (5): 1314-1319.DOI: 10.11772/j.issn.1001-9081.2015.05.1314

基于多元判别分析的汉语句群自动划分方法

王荣波¹, 李杰¹, 黄孝喜¹, 周昌乐^1,2, 谌志群¹, 王小华¹

1. 杭州电子科技大学认知与智能计算研究所, 杭州 310018;
2. 厦门大学智能科学与技术系, 福建厦门 361005

收稿日期:2014-12-05 修回日期:2014-12-24 出版日期:2015-05-10 发布日期:2015-05-14
通讯作者: 李杰
作者简介:王荣波(1978-),男,浙江义乌人,副教授,博士,CCF会员,主要研究方向:自然语言处理、篇章分析; 李杰(1989-),男,浙江温州人,硕士研究生,主要研究方向:中文信息处理; 黄孝喜(1979-),男,浙江温州人,讲师,博士,主要研究方向:自然语言处理、认知逻辑学;周昌乐(1959-),男,苏州太仓人,教授,博士,主要研究方向:人工智能、中文信息处理; 谌志群(1973-),男,江西南昌人,副教授,硕士,主要研究方向:中文信息处理、语言网络; 王小华(1961-),男,浙江温州人,教授,主要研究方向:自然语言处理、模式识别.
基金资助:
国家自然科学基金资助项目(61202281,61103101);教育部人文社会科学研究项目青年基金资助项目(10YJCZH052, 12YJCZH201).

Automatic Chinese sentences group method based on multiple discriminant analysis

WANG Rongbo¹, LI Jie¹, HUANG Xiaoxi¹, ZHOU Changle^1,2, CHEN Zhiqun¹, WANG Xiaohua¹

1. Institute of Cognitive and Intelligent Computing, Hangzhou Dianzi University, Hangzhou Zhejiang 310018, China;
2. Department of Intelligent Science and Technology, Xiamen University, Xiamen Fujian 361005, China

Received:2014-12-05 Revised:2014-12-24 Online:2015-05-10 Published:2015-05-14

摘要/Abstract

摘要：

针对目前句群划分工作缺乏计算语言学数据支持、忽略篇章衔接词的问题以及当前篇章分析较少研究句群语法单位的现象,提出一种汉语句群自动划分方法.该方法以汉语句群理论为指导,构建汉语句群划分标注评测语料,并且基于多元判别分析(MDA)方法设计了一组评价函数J,从而实现汉语句群的自动划分.实验结果表明,引入切分片段长度因素和篇章衔接词因素可以改善句群划分性能,并且利用Skip-Gram Model比传统的向量空间模型(VSM)有更好的效果,其正确分割率P_μ 达到85.37%、错误分割率WindowDiff降到24.08%.同时该方法在句群划分任务上有更大的优势,比传统MDA方法有更好的句群划分效果.

关键词: 汉语句群划分, 多元判别分析, 篇章分析, Skip-Gram模型, 篇章衔接

Abstract:

In order to solve the problems in Chinese sentence grouping domain, including the lack of computational linguistics data and the joint makers in a discourse, this paper proposed an automatic Chinese sentence grouping method based on Multiple Discriminant Analysis (MDA). Moreover, sentences group was rarely considered as a grammar unit. An annotated evaluation corpus for Chinese sentence group was constructed based on Chinese sentence group theory. And then, a group of evaluation functions J was designed based on the MDA method to realize automatic Chinese sentence grouping. The experimental results show that the length of a segmented unit and one discourse's joint makers contribute to the performance of Chinese sentence group. And the Skip-Gram model has a better effect than the traditional Vector Space Model (VSM). The evaluation parameter P_μ reaches to 85.37% and WindowDiff reduces to 24.08% respectively. The proposed method has better grouping performance than that of the original MDA method.

Key words: Chinese sentences grouping, Multiple Discriminant Analysis (MDA), discourse analysis, Skip-Gram model, discourse coherence

中图分类号:

TP391

王荣波, 李杰, 黄孝喜, 周昌乐, 谌志群, 王小华. 基于多元判别分析的汉语句群自动划分方法[J]. 计算机应用, 2015, 35(5): 1314-1319.

WANG Rongbo, LI Jie, HUANG Xiaoxi, ZHOU Changle, CHEN Zhiqun, WANG Xiaohua. Automatic Chinese sentences group method based on multiple discriminant analysis[J]. Journal of Computer Applications, 2015, 35(5): 1314-1319.

参考文献

[1] ZHU J, YE N, LUO H. Text segmentation model based on multiple discriminant analysis[J]. Journal of Software, 2007, 18(3): 555-564.(朱靖波, 叶娜, 罗海涛. 基于多元判别分析的文本分割模型[J]. 软件学报, 2007, 18(3): 555-564.)
[2] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]// NIPS 2013: Proceedings of the Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013: 3111-3119.
[3] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C/OL].[2014-04-20]. http://arxiv.org/pdf/1301.3781.pdf.
[4] WANG Y. The analysis of English sentence group[J]. Journal of University of Shanghai for Science an Technology: Social Sciences, 2004, 26(2): 30-32.(王跃洪. 英语句群分析[J]. 上海理工大学学报:社会科学版, 2004, 26(2): 30-32.)
[5] LUO T. Discussion on using sentences group as valid basic unit for translation from Chinese to English[J]. Journal of Southeast University: Philosophy and Social Science, 2006, 8(3): 110-113.(罗天妮. 论以句群为汉英翻译的有效基本单位[J]. 东南大学学报: 哲学社会科学版, 2006, 8(3): 110-113.)
[6] XU F, ZHU Q, ZHOU G. Survey of discourse analysis methods[J]. Journal of Chinese Information Processing, 2013, 27(3): 20-32.(徐凡, 朱巧明, 周国栋. 篇章分析技术综述[J]. 中文信息学报, 2013, 27(3): 20-32.)
[7] MANN W C, THOMPSION S A. Rhetorical structure theory: a theory of text organization[J]. Text, 1988, 3(8): 243-281.
[8] WEBBER B. D-LTAG: extending lexicalized TAG to discourse[J]. Cognitive Science, 2004, 28(5): 751-779.
[9] WU W, TIAN X. Chinese sentence group[M]. Beijing: The Commercial Press, 2000: 81-88.(吴为章, 田小琳. 汉语句群[M]. 北京:商务印书馆, 2000: 81-88.)
[10] HAO C. Text paragraph knowledge[M]. Beijing: Beijing Press, 1983: 1-29. (郝长留. 语段知识[M]. 北京:北京出版社, 1983: 1-29.)
[11] CAO Z. Primary research on sentences groups[M]. Hangzhou: Zhejiang Education Publishing House, 1984: 15-17.(曹政. 句群初探[M]. 杭州:浙江教育出版社, 1984:15-17.)
[12] CHEN L. Rhetorical structure theory and sentences group analysis[J]. Journal of Suzhou University: Philosophy and Social Science, 2008,29(4): 118-121.(陈莉萍. 修辞结构理论与句群研究[J]. 苏州大学学报:哲学社会科学版, 2008,29(4): 118-121.)
[13] GAO Y. Exploring the rhetorical form of Chinese discourse structure from the angle of SDRT[D]. Chongqing: Southwest University, 2011.(高芸. 从SDRT的视角探析汉语话语结构的修辞格式[D]. 重庆:西南大学, 2011.)
[14] ASHER N, LASEARIDE. Logics of conversation[M]. London: Cambridge University Press, 2003:6-35.
[15] XU F, ZHU Q, ZHOU G. Implicit discourse relation recognition based on tree kernel[J]. Journal of Software, 2013, 24(5): 1022-1035.(徐凡, 朱巧明, 周国栋. 基于树核的隐式篇章关系识别[J]. 软件学报, 2013, 24(5): 1022-1035.)
[16] ZHOU X, HONG Y, CHE T, et al. Implicit discourse relation inference based parallel arguments[J]. Computer Applications and Software, 2012, 29(9): 57-61.(周小佩, 洪宇, 车婷婷, 等. 基于平行论元的隐式篇章关系推理研究[J]. 计算机应用与软件, 2012, 29(9): 57-61.)
[17] ZHANG Y, LU R, SHEN L. A hybrid method for automatic chinese discourse structure analysis[J]. Journal of Software, 2000, 11(11): 1527-1533.(张益民, 陆汝占, 沈李斌. 一种混合型的汉语篇章结构自动分析方法[J]. 软件学报, 2000, 11(11): 1527-1533.)
[18] WU C, ZHANG Q. Research on rules for detecting Chinese sentence groups in nature language processing[J]. Computer Engineering, 2007, 33(4): 157-159.(吴晨, 张全. 自然语言处理中句群划分及其判定规则研究[J]. 计算机工程, 2007, 33(4): 157-159.)
[19] MIAO J, ZHANG Q. The study of sentence group based on the HNC context theory[C]// The Research on Content Computing and Its Applications: 9th Chinese National Conference on Computational Linguistics. Beijing: Tsinghua University Press, 2007:258-263.(缪建明, 张全. 基于HNC语境理论的句群处理研究[C]// 内容计算的研究与应用前沿:第九届全国计算语言学学术会议. 北京: 清华大学出版社, 2007: 258-263.)
[20] CHEN Y, SHI X. Automatic partition of Chinese sentence group[J]. Journal of Donghua University: English Edition, 2010, 27(2): 177-180.
[21] BENGIO Y, SCHWENK H, SENECAL J S, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(2): 1137-1155.
[22] BEEFERMAN D, BERGER A, LAFFERTY J. Statistical models for text segmentation[J]. Machine Learning, 1999,34(1/2/3): 177-210.
[23] HEARST L P M. A critique and improvement of an evaluation metric for text segmentation[J]. Computational Linguistics, 2002, 28(1): 19-36.

基于多元判别分析的汉语句群自动划分方法

Automatic Chinese sentences group method based on multiple discriminant analysis

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	章悦, 张亮, 谢非, 杨嘉乐, 张瑞, 刘益剑. 基于实例分割模型优化的道路抛洒物检测算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3228-3233.
[2]	李凯, 李洁. 基于pinball损失的结构模糊多分类支持向量机算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3104-3112.
[3]	胡誉生, 何炳蔚, 邓清康. 混合视觉系统的运动物体检测和静态地图重建[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3332-3336.
[4]	高洁, 朱元, 陆科. 基于雷达和相机融合的目标检测方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3242-3250.
[5]	彭博, 罗娅茹, 谢盛华, 尹立雪. 联合深度学习的通用血流向量成像方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3368-3375.
[6]	陈吉成, 陈鸿昶. 基于张量建模和进化K均值聚类的社区检测方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3120-3126.
[7]	张嘉琪, 张月琴, 陈健. 优化强化学习路径特征分类的脉象识别法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3402-3408.
[8]	任俊伟, 曾诚, 肖丝雨, 乔金霞, 何鹏. 基于会话的多粒度图神经网络推荐模型[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3164-3170.
[9]	孙琳, 袁玉波. 基于人眼状态的瞌睡识别算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3213-3218.
[10]	葛晨宇, 董良, 许伊昆, 常毅, 张宏鸣. 基于总变分低秩组稀疏的全球雷达数据修复算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3353-3361.
[11]	闫钧华, 侯平, 张寅, 吕向阳, 马越, 王高飞. 基于多尺度多分类器卷积神经网络的混合失真类型判定方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3178-3184.
[12]	李福海, 蒋慕蓉, 杨磊, 谌俊毅. 基于生成对抗网络的梯度引导太阳斑点图像去模糊方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3345-3352.
[13]	曹建芳, 闫敏敏, 贾一鸣, 田晓东. 融合迁移学习的Inception-v3模型在古壁画朝代识别中的应用[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3219-3227.
[14]	刘太亨, 何昭水. 基于自编码和知识蒸馏的表面缺陷检测方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3200-3205.
[15]	张阳, 王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3151-3155.