计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 154-157.DOI: 10.11772/j.issn.1001-9081.2016.01.0154

• 人工智能 • 上一篇    下一篇

基于递归自编码器的广告短语相关性

胡庆辉1,2, 魏士伟2, 解忠乾1, 任亚峰1   

  1. 1. 武汉大学 计算机学院, 武汉 430072;
    2. 桂林航天工业学院 广西高校机器人与焊接技术重点实验室培育基地, 广西 桂林 541004
  • 收稿日期:2015-07-27 修回日期:2015-08-31 出版日期:2016-01-10 发布日期:2016-01-09
  • 通讯作者: 魏士伟(1979-),男,河南鹤壁人,讲师,硕士,主要研究方向:自然语言处理
  • 作者简介:胡庆辉(1976-),男,重庆开县人,副教授,博士研究生,主要研究方向:多核学习、监督学习、半监督学习、自然语言处理;解忠乾(1989-),男,山东菏泽人,硕士,CCF会员,主要研究方向:自然语言处理;任亚峰(1985-),男,河南沁阳人,博士,主要研究方向:数据挖掘、自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(11301106);广西自然科学基金资助项目(2014GXNSFAA1183105);广西高校科研资助项目(ZD2014147,YB2014431)。

Correlation between phrases in advertisement based on recursive autoencoder

HU Qinghui1,2, WEI Shiwei2, XIE Zhongqian1, REN Yafeng1   

  1. 1. School of Computer, Wuhan University, Wuhan Hubei 430072, China;
    2. Key Laboratory Breeding Base of Robot and Welding Technology of Guangxi Colleges and Universities, Guilin University of Aerospace Technology, Guilin Guangxi 541004, China
  • Received:2015-07-27 Revised:2015-08-31 Online:2016-01-10 Published:2016-01-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (11301106), the Natural Science Foundation of Guangxi (2014GXNSFAA1183105), the High Education Science Research Project of Guangxi (ZD2014147, YB2014431).

摘要: 针对现有广告短语相关性研究成果多采用字面匹配,忽略了短语所包含的深层语义信息,限制了任务的性能等问题,提出了采用深度学习算法研究广告短语的相关性,采用递归自编码器(RAE)对短语进行深层结构分析,使得短语向量包含深层的语义信息,以此来构建广告语境下的短语相关性计算方法。具体地,给定一个包含若干词的序列,序列中所有相邻的两个元素尝试合并产生一个重构误差,遍历将重构误差最小的元素两两合并,形成类似哈夫曼树结构的短语树。采用梯度下降法最小化短语树的重构误差,采用余弦距离度量短语之间的相关性。实验结果显示,通过引入词语权重信息,加大了重要词语在最终短语向量表示中贡献的信息量,使得RAE更适合短语计算;比起传统LDA和BM25算法,在50%召回率的条件下,提出的算法的准确率分别提高了4.59个百分点和3.21个百分点,这证明了所提算法的有效性。

关键词: 深度学习, 递归自编码器, 词向量, 计算广告, 搜索引擎

Abstract: Focusing on the issue that most research results on correlation between advertising phrases stay in the literal level, and can not exploit deep semantic information of the phrases, which limits the performance of the task, a novel method was proposed to calculate the correlation between the phrases by using deep learning technique. Recursive AutoEncoder (RAE) was developed to make full use of semantic information in the word order and phrase, which made the phrase vector contain more deep semantic information, and built the calculating method of correlation under the advertising situation. Specifically, for a given list of a few phrases, reconstruction error was produced by merging the adjacent two elements. Phrase tree, which similar to the Huffman tree, was produced by merging two elements with smallest reconstruction error in turn. Gradient descent and Cosine distance were used to minimize the reconstruction error of phrase tree and measure the correlation between the phrases respectively. The experimental results show that the contribution of the important phrases is increased in the representation of the final phrase vector by introducing weight information, and RAE is more suitable for phrase calculation. The proposed method increases the accuracy by 4.59% and 3.21% respectively compared with LDA (Latent Dirichlet Allocation) and BM25 algorithm under the same condition of 50% recall rate, which proves its effectiveness.

Key words: deep learning, Recursive AutoEncoder (RAE), word vector, computational advertising, search engine

中图分类号: