基于集成学习的无监督离散化算法

doi:10.11772/j.issn.1001-9081.2014.08.2184

计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2184-2187.DOI: 10.11772/j.issn.1001-9081.2014.08.2184

• 第五届中国数据挖掘会议(CCDM 2014)论文 • 上一篇下一篇

基于集成学习的无监督离散化算法

徐盈盈¹,²,钟才明¹,²

1. 宁波大学科学技术学院，浙江宁波315210
2. 宁波大学信息科学与工程学院，浙江宁波315210

收稿日期:2014-04-30 修回日期:2014-05-08 出版日期:2014-08-01 发布日期:2014-08-10
通讯作者: 徐盈盈
作者简介:徐盈盈(1990-）,女，安徽桐城人，硕士研究生，主要研究方向：机器学习、模式识别；钟才明(1970-），男，浙江宁波人，副教授，博士，主要研究方向：模式识别、机器学习。
基金资助:
国家自然科学基金资助项目

Unsupervised discretization algorithm based on ensemble learning

XU Yingying¹,²,ZHONG Caiming¹,²

1. College of Information Science and Engineering, Ningbo University, Ningbo Zhejiang 315210, China;
2. College of Science and Technology, Ningbo University, Ningbo Zhejiang 315210, China

Received:2014-04-30 Revised:2014-05-08 Online:2014-08-01 Published:2014-08-10
Contact: XU Yingying

摘要/Abstract

摘要：

模式识别与机器学习的一些算法只能处理离散属性值，而在现实生活中的很多数据具有连续的属性值，针对数据离散化的问题提出了一种无监督的方法。首先，使用K-means方法将数据集进行划分得到类别信息；然后，应用有监督的离散化方法对划分后的数据离散化，重复上述过程以得到多个离散化的结果，再将这些结果进行集成；最后，将集成得到的最小子区间进行合并，这里根据数据间的邻居关系选择优先合并的维度及相邻区间。其中，通过数据间的近邻关系自动寻求子区间数目，尽可能保持其内在结构关系不变。将离散后的数据应用于聚类算法，如谱聚类算法，并对聚类后的效果进行评价。实验结果表明，该算法聚类精确度比其他4种方法平均提高约33%，表明了该算法的可行性和有效性。通过该算法得到的离散化数据可应用于一些数据挖掘算法，如ID3决策树算法。

Abstract:

Some algorithms in pattern recognition and machine learning can only deal with discrete attribute values, while in real world many data sets consist of continuous data values. An unsupervised method was proposed according to the question of discretization. First, K-means method was employed to partition the data set into multiple subgroups to acquire label information, and then a supervised discretization algorithm was applied to the divided data set. When the process was repeatedly executed, multiple discrete results were obtained. These results were then integrated with an ensemble technique. Finally, the minimum sub-intervals were merged after priority dimensions and adjacent intervals were determined according to the neighbor relationship of data, where the number of sub-intervals was automatically estimated by preserving the correlation so that the intrinsic structure of the data set was maintained. The experimental results of applying categorical clustering algorithms such as spectral clustering demonstrate the feasibility and effectiveness of the proposed method. For example, its clustering accuracy improves by about 33% on average than other four methods. Discrete data attained can be used for some data mining algorithm, such as ID3 decision tree algorithm.

中图分类号:

TP391
TP18

徐盈盈钟才明. 基于集成学习的无监督离散化算法[J]. 计算机应用, 2014, 34(8): 2184-2187.

XU Yingying ZHONG Caiming. Unsupervised discretization algorithm based on ensemble learning[J]. Journal of Computer Applications, 2014, 34(8): 2184-2187.

参考文献

［1］YANG A, ZHOU Y, LIN J. A method of Chinese texts sentiment classification based on Bayesian algorithm ［J］. Applied Mechanics and Materials, 2012, 263/264/265/266: 2185-2190.
［2］YANG A, LIN J, ZHOU Y, et al. Research on building a Chinese sentiment lexicon based on SO-PMI ［J］. Applied Mechanics and Materials, 2012, 263/264/265/266: 1688-1693.
［3］CUI G, CHENG Y. Corpus annotation in the corpus ［J］.Journal of Tsinghua University: Philosophy and Social Sciences, 2000(1): 89-94.(崔刚,盛永梅.语料库中语料的标注［J］.清华大学学报:哲学社会科学版,2000(1):89-94.)
［4］LI S. Sensiment classification of micro-blogs corpus based on automatic annotation training set ［D］. Shenyang: Northeast Normal University, 2013.(李圣楠.基于自动标注训练集的微博语料情感分类的研究［D］.沈阳：东北师范大学，2013.)
［5］XU L, LIN H, ZHAO J. Construction and analysis of emotional corpus ［J］. Journal of Chinese Information Processing, 2008, 22(1): 116-122.(徐琳宏,林鸿飞,赵晶.情感语料库的构建和分析［J］.中文信息学报,2008,22(1):116-122.)
［6］PANG L, LI S, ZHOU G. Sentiment classification method of Chinese micro-blog based on emotional knowledge ［J］. Computer Engineering, 2012, 38(13): 156-158.(庞磊,李寿山,周国栋.基于情绪知识的中文微博情感分类方法［J］.计算机工程,2012,38(13):156-158.)
［7］HAN Z, ZHANG Y, ZHANG H, et al. On effective short text tendency classification algorithm for Chinese microblogging ［J］. Computer Applications and Software, 2012, 29(10): 89-93.(韩忠明,张玉沙,张慧,等.有效的中文微博短文本倾向性分类算法［J］.计算机应用与软件,2012,29(10):89-93.)
［8］YANG A. Fuzzy classification models and ensemble methods ［M］.Beijing: Science Press,2008.(阳爱民.模糊分类模型及其集成方法［M］.北京:科学出版社,2008.)
［9］China Computer Federation. Test data for evaluation ［EB/OL］. ［2013-12-10］. http://tcci.ccf.org.cn/conference/2013/pages/page04_tdata.html.(中国计算机学会.评测测试数据［EB/OL］. ［2013-12-10］. http://tcci.ccf.org.cn/conference/2013/pages/page04_tdata.html.)
［10］Information Retrieval Laboratory, Dalian University of Technology. Emotional vocabulary ontology database ［EB/OL］. ［2014-01-18］. http://ir.dlut.edu.cn/EmotionOntologyDownload.aspx?utm_source=weibolife.(大连理工大学信息检索研究室.情感词汇本体库［EB/OL］. ［2014-01-18］. http://ir.dlut.edu.cn/EmotionOntologyDownload.aspx?utm_source=weibolife.)
［11］JIANG F, ZHANG H, LIU Y, et al. THUIR-Senti at Chinese microblog mood analysis evaluation ［EB/OL］. ［2013-12-02］. http://tcci.ccf.org.cn/conference/2013/dldoc/evrpt02.rar.(姜飞,张辉,刘奕群,等.THUIR-Senti中文微博情绪分析评测报告［EB/OL］. ［2013-12-02］. http://tcci.ccf.org.cn/conference/2013/dldoc/evrpt02.rar.)
［12］SUN X, YE J, TANG C, et al. Multi-granularity based Chinese microblog sentiment analysis ［EB/OL］. ［2013-12-02］. http://tcci.ccf.org.cn/conference/2013/dldoc/evrpt02.rar.(孙晓，叶嘉琪，唐诚意,等.基于多粒度模型的中文微博情感分析［EB/OL］. ［2013-12-02］. http://tcci.ccf.org.cn/conference/2013/dldoc/evrpt02.rar.)

[1]	吴军欧阳艾嘉张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 计算机应用, 0, (): 0-0.
[2]	张璐方春祝铭. 基于Res2Net-YOLACT和融合特征的室内跌倒检测算法[J]. 计算机应用, 0, (): 0-0.
[3]	殷雨昌王洪元陈莉冯尊登肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 计算机应用, 0, (): 0-0.
[4]	胡军许正康刘立钟福金张清华. 融合多粒度社区信息的网络嵌入方法[J]. 计算机应用, 0, (): 0-0.
[5]	李润泽孙雪姣. 基于时间条件提取序列的数据流偏好查询[J]. 计算机应用, 0, (): 0-0.
[6]	罗圣钦陈金怡李洪均. 基于注意力机制的多尺度残差UNet实现乳腺癌灶分割[J]. 计算机应用, 0, (): 0-0.
[7]	曹一珉蔡磊高敬阳. 基于生成对抗网络的基因数据生成方法[J]. 计算机应用, 0, (): 0-0.
[8]	陈冲闫珠赵继轩何为梁华庆. 基于集合经验模态分解和长短期记忆网络的催化裂化装置NOx排放预测[J]. 计算机应用, 0, (): 0-0.
[9]	徐光柱林文杰陈莎匡婉雷帮军周军. U-Net与自适应阈值脉冲耦合神经网络相结合的眼底血管分割方法[J]. 计算机应用, 0, (): 0-0.
[10]	杨鼎康黄帅王顺利翟鹏李一丹张立华. 基于对抗生成网络和网络集成的面部表情识别方法EE-GAN[J]. 计算机应用, 0, (): 0-0.
[11]	李讷徐光柱雷帮军马国亮石勇涛. 交通道路行驶车辆车标识别算法[J]. 计算机应用, 0, (): 0-0.
[12]	孟杰王莉杨延杰廉飚. 基于多模态深度融合的虚假信息检测[J]. 计算机应用, 0, (): 0-0.
[13]	秦庭威赵鹏程秦品乐曾建朝柴锐黄永琦. 基于残差注意力机制的点云配准算法[J]. 计算机应用, 0, (): 0-0.
[14]	鲁永帅唐英杰马鑫然. 基于深度特征融合的无纺布低对比度浆丝缺陷检测方法[J]. 计算机应用, 0, (): 0-0.
[15]	王宇航周永霞吴良武. 基于高斯函数的池化算法[J]. 计算机应用, 0, (): 0-0.

基于集成学习的无监督离散化算法

Unsupervised discretization algorithm based on ensemble learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics