Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (11): 3105-3111.DOI: 10.11772/j.issn.1001-9081.2018041275

Previous Articles     Next Articles

Feature selection for multi-label distribution learning with streaming data based on rough set

CHENG Yusheng1,2,3, CHEN Fei1, WANG Yibin1,2   

  1. 1. School of Computer and Information, Anqing Normal University, Anqing Anhui 246011, China;
    2. University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing Anhui 246011, China;
    3. Key Laboratory of Data Science and Intelligence Application, Fujian Province University, Zhangzhou Fujian 363000, China
  • Received:2018-04-28 Revised:2018-06-20 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the Natural Science Research Funds of Education Department of Anhui Province (KJ2017A352), the Key Laboratory of Data Science and Intelligence Application, Fujian Province University (D1801).

基于粗糙集的数据流多标记分布特征选择

程玉胜1,2,3, 陈飞1, 王一宾1,2   

  1. 1. 安庆师范大学 计算机与信息学院, 安徽 安庆 246011;
    2. 安徽省智能感知与计算重点实验室, 安徽 安庆 246011;
    3. 数据科学与智能应用福建省高校重点实验室, 福建 漳州 363000
  • 通讯作者: 程玉胜
  • 作者简介:程玉胜(1969-),男,安徽桐城人,教授,博士,主要研究方向:粗糙集、机器学习、数据挖掘;陈飞(1994-),男,安徽铜陵人,硕士研究生,CCF会员,主要研究方向:多标记学习、粗糙集;王一宾(1970-),男,安徽安庆人,教授,硕士,CCF会员,主要研究方向:机器学习、多标记学习。
  • 基金资助:
    安徽省高校重点科研项目(KJ2017A352);数据科学与智能应用福建省高校重点实验室开放课题(D1801)。

Abstract: Traditional feature selection algorithm cannot process streaming feature data, the redundancy calculation is complicated and the description of the instance is not accurate enough. A multi-label Distribution learning Feature Selection with Streaming Data Using Rough Set (FSSRS) was proposed to solve the above problem. Firstly, the online streaming feature selection framework was introduced into multi-label learning. Secondly, the original conditional probability was replaced by the dependency in rough set theory, which made the streaming data feature selection algorithm more efficient and faster than before by only using the information calculation of the data itself. Finally, since each label has a different degree of description for the same instance in real world, to make the description of the instance more accurate, label distribution was used to instead of traditional logical labels. The experimental results show that the proposed algorithm can retain the features with high correlation with the label space, so that the classification accuracy is improved to a certain extent compared with that without feature selection.

Key words: rough set, multi-label, streaming data, feature selection, label distribution

摘要: 针对传统特征选择算法无法处理流特征数据、冗余性计算复杂、对实例描述不够准确的问题,提出了基于粗糙集的数据流多标记分布特征选择算法。首先,将在线流特征选择框架引入多标记学习中;其次,用粗糙集中的依赖度替代原有的条件概率,仅仅利用数据自身的信息计算,使得数据流特征选择算法更加高效快捷;最后,由于在现实世界中,每个标记对实例的描述程度并不相同,为更加准确地描述实例,将传统的逻辑标记用标记分布的形式进行刻画。在多组数据集上的实验表明,所提算法能保留与标记空间有着较高相关性的特征,使得分类精度相较于未进行特征选择的有一定程度的提高。

关键词: 粗糙集, 多标记, 数据流, 特征选择, 标记分布

CLC Number: