Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (11): 3151-3155.DOI: 10.11772/j.issn.1001-9081.2020122032

• Artificial intelligence • Previous Articles     Next Articles

Text feature selection method based on Word2Vec word embedding and genetic algorithm for biomarker selection in high-dimensional omics

Yang ZHANG1, Xiaoning WANG1,2()   

  1. 1.School of Data Science and Intelligent Media,Communication University of China,Beijing 100024,China
    2.State Key Laboratory of Media Convergence and Communication (Communication University of China),Beijing 100024,China
  • Received:2020-12-24 Revised:2021-07-30 Accepted:2021-08-04 Online:2021-07-20 Published:2021-11-10
  • Contact: Xiaoning WANG
  • About author:ZHANG Yang,born in 1996,M. S. candidate. His research interests include text mining
    WANG Xiaoning,born in 1989,Ph. D.,lecturer. His research interests include text mining,machine learning
  • Supported by:
    the Surface Program of Beijing Municipal Natural Science Foundation(9202018);the Fundamental Research Funds for the Central Universities(CUC200F08)

基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法

张阳1, 王小宁1,2()   

  1. 1.中国传媒大学 数据科学与智能媒体学院,北京 100024
    2.媒体融合与传播国家重点实验室(中国传媒大学),北京 100024
  • 通讯作者: 王小宁
  • 作者简介:张阳(1996—),男,河北邯郸人,硕士研究生,主要研究方向:文本挖掘
    王小宁(1989—),男,山东德州人,讲师,博士,CCF会员,主要研究方向:文本挖掘、机器学习。
  • 基金资助:
    北京市自然科学基金面上项目(9202018);中央高校基本科研业务费专项基金资助项目(CUC200F08)

Abstract:

Text feature is the key part of natural language processing. Concerning the problems of high dimensionality and sparseness of text features, a text feature selection method based on Word2Vec word embedding and Genetic AlgoRithm for Biomarker selection in high-dimensional Omics (GARBO) was proposed, so as to facilitate the subsequent text classification tasks. Firstly, the data input form was optimized, and the Word2Vec word embedding method was used to transform the text into the word vectors similar to gene expression. Then, the gene expression simulated by the high-dimensional word vectors was iteratively evolved. Finally, the random forest classifier was used to classify the text after feature selection. The experiments were conducted on the Chinese comment dataset to verify the proposed method. The experimental results show that, the optimized GARBO feature selection method is effective in text feature selection, successfully reducing 300-dimensional features to 50-dimensional features with more value, and has the classification accuracy reached 88%. Compared with other filtering type text feature selection methods, the proposed method can effectively reduce the dimension of text features and improve the effect of text classification.

Key words: text classification, genetic algorithm, feature dimensionality reduction, Word2Vec, text feature

摘要:

文本特征是自然语言处理中的关键部分。针对目前文本特征的高维性和稀疏性问题,提出了一种基于Word2Vec词嵌入和高维生物基因选择遗传算法(GARBO)的文本特征选择方法,从而便于后续文本分类任务。首先,优化数据输入形式,使用Word2Vec词嵌入方法将文本转变成类似基因表示的词向量;然后,将高维词向量模拟基因表达方式进行迭代进化;最后,使用随机森林分类器对特征选择后的文本进行分类。使用中文评论数据集对所提出的方法进行实验,实验结果表明了优化后的GARBO特征选择方法在文本特征选择上的有效性,该方法成功地将300维特征降低为50维更有价值的特征,分类准确率达到88%,与其他过滤式文本特征选择方法相比,能够有效地降低文本特征维度,提高文本分类效果。

关键词: 文本分类, 遗传算法, 特征降维, Word2Vec, 文本特征

CLC Number: