计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 640-646.DOI: 10.11772/j.issn.1001-9081.2017.03.640

• 第四届大数据学术会议(CCF BIGDATA2016) • 上一篇    下一篇

利用CUR矩阵分解提高特征选择与矩阵恢复能力

雷恒鑫, 刘惊雷   

  1. 烟台大学 计算机与控制工程学院, 山东 烟台 264005
  • 收稿日期:2016-09-28 修回日期:2016-10-16 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 刘惊雷
  • 作者简介:雷恒鑫(1993-),男,山东阳谷人,硕士研究生,主要研究方向:矩阵分解及其应用;刘惊雷(1970-),男,山西临猗人,副教授,硕士,CCF会员,主要研究方向:人工智能、理论计算机科学。
  • 基金资助:
    国家自然科学基金资助项目(61572419,61572418,61403328,61403329);山东省自然科学基金资助项目(ZR2014FQ016,ZR2014FQ026,2015GSF115009,ZR2013FM011)。

Improving feature selection and matrix recovery ability by CUR matrix decomposition

LEI Hengxin, LIU Jinglei   

  1. School of Computer and Control Engineering, Yantai University, Yantai Shandong 264005, China
  • Received:2016-09-28 Revised:2016-10-16 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61572419, 61572418, 61403328, 61403329), the Natural Science Foundation of Shandong Province (ZR2014FQ016, ZR2014FQ026, 2015GSF115009, ZR2013FM011).

摘要: 针对在规模庞大的数据中不能快速准确地选择用户和产品的特征以及不能准确预测用户行为偏好的问题,提出一种CUR矩阵分解方法。该方法是从原始矩阵中选取少量列构成C矩阵,选取少量行构成R矩阵,然后利用正交三角分解(QR)构造U矩阵。分解后的C矩阵和R矩阵分别是用户和产品的特征矩阵,并且CR矩阵是由真实的数据构成的,因此能够分析出具体的用户和产品特征;为了能够比较准确地预测用户的行为偏好,改进了CUR算法,使其在矩阵恢复方面有更高的稳定性和准确性。最后在真实的数据集(Netflix数据集)上的实验表明,与传统的奇异值分解、主成分分析等矩阵分解方法相比:在特征选择方面,CUR矩阵分解方法具有较高的准确度和很好的可解释性;在矩阵恢复方面,改进的CUR矩阵分解方法具有较高的稳定性和精确度,其准确度能达到90%以上。CUR矩阵分解在推荐系统对用户的推荐方面和交通系统预测交通流量方面有重要的应用价值。

关键词: 行列联合选择算法, 特征选择, 矩阵恢复, 可解释性, 稳定性

Abstract: To solve the problem that users and products can not be accurately selected in large data sets, and the problem that user behavior preference can not be predicted accurately, a new method of CUR (Column Union Row) matrix decomposition was proposed. A small number of columns were selected from the original matrix to form the matrix C, and a small number of rows were selected to form the matrix R. Then, the matrix U was constructed by Orthogonal Rotation (QR) matrix decomposition. The matrixes C and R were feature matrixes of users and products respectively, which were composed of real data, and enabled to reflect the detailed characters of both users as well as products. In order to predict behavioral preferences of users accurately, the authors improved the CUR algorithm in this paper, endowing it with greater stability and accuracy in terms of matrix recovery. Lastly, the experiment based on real dataset (Netflix dataset) indicates that, compared with traditional singular value decomposition, principal component analysis and other matrix decomposition methods, the CUR matrix decomposition algorithm has higher accuracy as well as better interpretability in terms of feature selection, as for matrix recovery, the CUR matrix decomposition also shows superior stability and accuracy, with a preciseness of over 90%. The CUR matrix decomposition has a great application value in the recommender system and traffic flow prediction.

Key words: Column Union Row (CUR) algorithm, feature selection, matrix recovery, interpretability, stability

中图分类号: