计算机应用 ›› 2015, Vol. 35 ›› Issue (10): 2838-2842.DOI: 10.11772/j.issn.1001-9081.2015.10.2838

• 第十五届中国机器学习会议(CCML2015)论文 • 上一篇    下一篇

基于条件随机场的顿号边界识别

莫怡文1, 姬东鸿2, 黄江平2   

  1. 1. 武汉大学 文学院, 武汉 430072;
    2. 武汉大学 计算机学院, 武汉 430072
  • 收稿日期:2015-05-29 修回日期:2015-08-08 出版日期:2015-10-10 发布日期:2015-10-14
  • 通讯作者: 莫怡文(1978-),女,湖北武汉人,博士研究生,主要研究方向:中文信息处理,mofan@whu.edu.cn
  • 作者简介:姬东鸿(1967-),男,湖北武汉人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器学习;黄江平(1985-),男,湖北武汉人,博士研究生,主要研究方向:自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(61133012,61373108)。

Slight-pause marks boundary identification based on conditional random field

MO Yiwen1, JI Donghong2, HUANG Jiangping2   

  1. 1. College of Chinese Language and Literature, Wuhan University, Wuhan Hubei 430072, China;
    2. Computer School, Wuhan University, Wuhan Hubei 430072, China
  • Received:2015-05-29 Revised:2015-08-08 Online:2015-10-10 Published:2015-10-14

摘要: 标点符号的边界识别是自然语言处理的重要研究内容,它是分词、语块识别等应用的基础。为了实现中文中用于表示并列成分分割的顿号的边界识别,采用了用于序列分割和标记的条件随机场(CRF)方法进行顿号边界识别。首先对顿号边界识别任务进行了两种类型的描述,然后对顿号语料的标注方法和过程以及特征选择进行了研究,通过采用语料推荐和十折交叉验证两种数据集分配方法分别进行了边界识别实验。实验结果表明,通过条件随机场方法结合选择的边界识别特征能够进行顿号边界识别,其顿号边界识别的F值在基准实验的基础上提高了10.57%,由顿号分隔的词语识别其F值可达85.24%。

关键词: 条件随机场, 顿号, 边界识别, 特征选择

Abstract: The boundary identification of punctuation marks is an important research field of natural language processing. It is the basis of the application of word segmentation and phrase chunking. In order to solve the problem that the boundary identification of Chinese slight-pause marks which split the coordinate words and phrases in Chinese, the Conditional Random Field (CRF) that used for sequence segmentation and labeling was adopted for slight-pause marks boundary identification. At first, the slight-puase marks boundary recognition task was described in two types, and then the slight-puase marks corpus tagging method and process and feature selection were studied. According to the methods of corpus recommendation and ten-fold cross validation, a series of experiments were carried out in slight-pause marks. The experimental result shows that the proposed method plays an effective role in slight-pause marks boundary identification with selected boundary identification features. And F-measure of boundary identification increased by 10.57% on baseline as well as the F-measure of words divided by slight-pause marks achieves 85.24%.

Key words: Conditional Random Field (CRF), slight-pause mark, boundary identification, feature selection

中图分类号: