基于差异—相似矩阵的文本降维方法

doi:10.3724/SP.J.1087.2005.01821

计算机应用 ›› 2005, Vol. 25 ›› Issue (08): 1821-1823.DOI: 10.3724/SP.J.1087.2005.01821

基于差异—相似矩阵的文本降维方法

黄晓春，晏蒲柳，夏德麟，陈健

武汉大学电子信息学院

发布日期:2011-04-07 出版日期:2005-08-01
基金资助:
国家自然科学基金资助项目(90204008)

Dimensionality reduction for text document using difference-similitude matrix

HUANG Xiao-chun,YAN Pu-liu,XIA De-lin,CHEN Jian

School of Electronic Information, Wuhan University, Wuhan Hubei 430079,China

Online:2011-04-07 Published:2005-08-01

摘要/Abstract

摘要： 由于文本文档数量多、词量大,形成的文档空间维度高,很多自动文本分类算法不能直接有效地发挥作用。基于差异—相似矩阵(DSM)的方法在很大程度上降低了文档空间的维度。已经分好类的文集经过预处理后被表示成特征项—文档矩阵,再转化为差异—相似矩阵,其中同类文档采用相似项描述,而异类文档则采用差异项描述。通过对差异—相似矩阵的处理,最终得到维度较低的文本特征集,并同时生成分类规则。实验说明,对于大规模文集,DSM方法能在保持良好的分类质量的同时,获得较高的属性降维率和样本降维率。

关键词: 文本分类, 维度消减, 差异&mdash, 相似矩阵

Abstract: Due to the huge amount of text documents and their vocabulary, document spaces are commonly of high dimensionality, and many automatical text categorization algorithms can not get their best performences directly. Difference-similitude Matrix-based (DSM) method reduces dimensionality to a great extend. Pre-classified collection is represented as a item-document matrix after preprocessing, then transmitted into a DSM, in which documents in the same classes are depicted with similitude while documents in different classes with difference. The method generates an item set of low dimensionality and a set of classification rules after dealing with the DSM. Results of experiments suggest that DSM-based method could achieve high attribute reduction degree and sample reduction degree with good classification quality.

Key words: text categorization, dimensionality reduction, DSM(Difference-Similitude Matrix)

中图分类号:

TP391.1

黄晓春，晏蒲柳，夏德麟，陈健. 基于差异—相似矩阵的文本降维方法[J]. 计算机应用, 2005, 25(08): 1821-1823.

HUANG Xiao-chun,YAN Pu-liu,XIA De-lin,CHEN Jian. Dimensionality reduction for text document using difference-similitude matrix[J]. Journal of Computer Applications, 2005, 25(08): 1821-1823.

[1]	张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.
[2]	温超东, 曾诚, 任俊伟, 张. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
[3]	张阳, 王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3151-3155.
[4]	尹春勇, 何苗. 基于改进胶囊网络的文本分类[J]. 计算机应用, 2020, 40(9): 2525-2530.
[5]	廖胜兰, 殷实, 陈小平, 张波, 欧阳昱, 张衡. 面向电力业务对话系统的意图识别数据集[J]. 计算机应用, 2020, 40(9): 2549-2554.
[6]	王敏蕊, 高曙, 袁自勇, 袁蕾. 基于动态路由序列生成模型的多标签文本分类方法[J]. 计算机应用, 2020, 40(7): 1884-1890.
[7]	李鸣, 郭晨皓, 陈星. 视觉类深度神经网络的自动标注[J]. 计算机应用, 2020, 40(6): 1593-1600.
[8]	王留洋, 俞扬信, 陈伯伦, 章慧. 基于共识和分类改善文档聚类的识别信息方法[J]. 计算机应用, 2020, 40(4): 1069-1073.
[9]	张小川, 戴旭尧, 刘璐, 冯天硕. 融合多头自注意力机制的中文短文本分类模型[J]. 计算机应用, 2020, 40(12): 3485-3489.
[10]	宋艳, 殷俊. 基于共享近邻的多视角谱聚类算法[J]. 计算机应用, 2020, 40(11): 3211-3216.
[11]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[12]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.
[13]	马建刚, 张鹏, 马应龙. 基于知识块摘要和词转移距离的高效司法文档分类[J]. 计算机应用, 2019, 39(5): 1293-1298.
[14]	邱宁佳, 丛琳, 周思丞, 王鹏, 李岩芳. 结合改进主动学习的SVD-CNN弹幕文本分类算法[J]. 计算机应用, 2019, 39(3): 644-650.
[15]	唐小川, 邱曦伟, 罗亮. 基于交互作用的文本分类特征选择算法[J]. 计算机应用, 2018, 38(7): 1857-1861.

基于差异—相似矩阵的文本降维方法

Dimensionality reduction for text document using difference-similitude matrix

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics