计算机应用 ›› 2015, Vol. 35 ›› Issue (7): 1965-1968.DOI: 10.11772/j.issn.1001-9081.2015.07.1965

• 人工智能 • 上一篇    下一篇

中文口语理解弱监督训练方法

李艳玲1,2, 颜永红2   

  1. 1. 内蒙古师范大学 计算机与信息工程学院, 呼和浩特 010022;
    2. 中国科学院语言声学与内容理解重点实验室(中国科学院声学研究所), 北京 100190
  • 收稿日期:2015-01-23 修回日期:2015-03-19 出版日期:2015-07-10 发布日期:2015-07-17
  • 通讯作者: 李艳玲(1978-),女,内蒙古呼和浩特人,讲师,博士,CCF会员,主要研究方向:自然语言处理、机器学习、信号处理,liyanling7871397@163.com
  • 作者简介:颜永红(1967-),男,江苏无锡人,研究员,博士生导师,主要研究方向:语音信号处理、口语系统及多模系统、人机界面。
  • 基金资助:

    国家自然科学基金资助项目(10925419, 90920302, 61072124, 11074275, 11161140319, 91120001, 61271426);中国科学院战略性先导科技专项(XDA06030100,XDA06030500);国家863计划项目(2012AA012503);中国科学院重点部署项目(KGZD-EW-103-2);内蒙古师范大学"十百千"人才培养工程项目;内蒙古自然科学基金面上项目(2012MS0930,2013MS0912);内蒙古自治区高等学校科学研究项目(NJZY12032,NJZY028);内蒙古师范大学引进高层次人才科研启动经费项目(2014YJRC036)。

Weakly-supervised training method about Chinese spoken language understanding

LI Yanling1,2, YAN Yonghong2   

  1. 1. College of Computer and Information Engineering, Inner Mongolia Normal University, Hohhot Nei Mongol 010022, China;
    2. Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences (Institute of Acoustics, Chinese Academy of Sciences), Beijing 100190, China
  • Received:2015-01-23 Revised:2015-03-19 Online:2015-07-10 Published:2015-07-17

摘要:

标注数据的获取一直是有监督方法需要面临的一个难题,针对中文口语理解任务中的意图识别研究了结合主动学习和自训练、协同训练两种弱监督训练方法,提出在级联框架下,从关键语义概念识别中获取语义类特征子集和句子本身的字特征子集分别作为两个"视角"的特征进行协同训练。通过在中文口语语料上进行的实验表明:结合主动学习和自训练的方法与被动学习、主动学习相比较,可以最大限度地降低人工标注量;而协同训练在很少的初始标注数据的前提下,利用两个特征子集进行协同训练,最终使得单一字特征子集上的分类错误率平均下降了0.52%。

关键词: 意图识别, 口语理解, 弱监督训练, 协同训练, 主动学习

Abstract:

Annotated corpus acquisition is a difficult problem in supervised approach. Aiming at the intention recognition task of Chinese spoken language understanding, two weakly supervised training approaches were studied. One is combining active learning with self-training, the other is co-training. A new method of acquiring two independent feature sets as two views for co-training was proposed based on spoken language understanding data in cascade frame. The two feature sets were character features of sentence and semantic class features obtained from key semantic concept recognition task. The experimental results on Chinese spoken language corpus show that the method combining active learning with self-training can minimize manual annotation compared with passive learning and active learning. Furthermore, under the premise of a few initial annotation data, co-training based on two feature sets can make the classification error rate fall in an average of 0.52% with single character feature set.

Key words: intention recognition, spoken language understanding, weakly-supervised training, co-training, active learning

中图分类号: