Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (11): 3151-3155.DOI: 10.11772/j.issn.1001-9081.2020122032

• Artificial intelligence • Previous Articles     Next Articles

Text feature selection method based on Word2Vec word embedding and genetic algorithm for biomarker selection in high-dimensional omics

Yang ZHANG1, Xiaoning WANG1,2()   

  1. 1.School of Data Science and Intelligent Media,Communication University of China,Beijing 100024,China
    2.State Key Laboratory of Media Convergence and Communication (Communication University of China),Beijing 100024,China
  • Received:2020-12-24 Revised:2021-07-30 Accepted:2021-08-04 Online:2021-07-20 Published:2021-11-10
  • Contact: Xiaoning WANG
  • About author:ZHANG Yang,born in 1996,M. S. candidate. His research interests include text mining
    WANG Xiaoning,born in 1989,Ph. D.,lecturer. His research interests include text mining,machine learning
  • Supported by:
    the Surface Program of Beijing Municipal Natural Science Foundation(9202018);the Fundamental Research Funds for the Central Universities(CUC200F08)


张阳1, 王小宁1,2()   

  1. 1.中国传媒大学 数据科学与智能媒体学院,北京 100024
    2.媒体融合与传播国家重点实验室(中国传媒大学),北京 100024
  • 通讯作者: 王小宁
  • 作者简介:张阳(1996—),男,河北邯郸人,硕士研究生,主要研究方向:文本挖掘
  • 基金资助:


Text feature is the key part of natural language processing. Concerning the problems of high dimensionality and sparseness of text features, a text feature selection method based on Word2Vec word embedding and Genetic AlgoRithm for Biomarker selection in high-dimensional Omics (GARBO) was proposed, so as to facilitate the subsequent text classification tasks. Firstly, the data input form was optimized, and the Word2Vec word embedding method was used to transform the text into the word vectors similar to gene expression. Then, the gene expression simulated by the high-dimensional word vectors was iteratively evolved. Finally, the random forest classifier was used to classify the text after feature selection. The experiments were conducted on the Chinese comment dataset to verify the proposed method. The experimental results show that, the optimized GARBO feature selection method is effective in text feature selection, successfully reducing 300-dimensional features to 50-dimensional features with more value, and has the classification accuracy reached 88%. Compared with other filtering type text feature selection methods, the proposed method can effectively reduce the dimension of text features and improve the effect of text classification.

Key words: text classification, genetic algorithm, feature dimensionality reduction, Word2Vec, text feature



关键词: 文本分类, 遗传算法, 特征降维, Word2Vec, 文本特征

CLC Number: