基于小波的搜索量聚类及在变量选择中的应用

doi:10.11772/j.issn.1001-9081.2015.03.802

计算机应用 ›› 2015, Vol. 35 ›› Issue (3): 802-806.DOI: 10.11772/j.issn.1001-9081.2015.03.802

基于小波的搜索量聚类及在变量选择中的应用

袁铭

天津财经大学统计学系, 天津 300222

收稿日期:2014-10-10 修回日期:2014-11-16 发布日期:2015-03-13 出版日期:2015-03-10
通讯作者: 袁铭
作者简介:袁铭(1982-),男,天津人,讲师,博士,主要研究方向:数据挖掘、人工智能、计算机技术在经济研究中的应用
基金资助:
天津哲学社会科学规划项目(TJTJ13-002)

Search data clustering based on wavelet and its application in variable selection

YUAN Ming

Department of Statistics, Tianjin University of Finance and Economics, Tianjin 300222, China

Received:2014-10-10 Revised:2014-11-16 Online:2015-03-13 Published:2015-03-10

摘要/Abstract

摘要：

针对使用网络购物搜索量数据建立预测模型时的变量选择问题,提出一种基于连续小波变换(CWT)及其逆变换的聚类方法。算法充分考虑了搜索量的数据特征,将原始序列分解成为不同时间尺度下的周期成分,并重构为输入向量。在此基础上通过加权模糊C均值(FCM)方法进行聚类。变量选择是根据聚类后每个分类中的关键词隶属度函数值确定的,选择效果通过我国居民消费价格指数(CPI)的预测模型进行验证。结果表明,搜索量序列具有不同长度的周期成分,聚类后同组关键词具有明显的商品类型一致性。与其他变量选择方法相比,基于小波重构序列聚类的预测模型具有更高的预测精度,单步和三步预测相对误差仅为0.3891%和0.5437%,预测变量也具有清晰的经济含义,因此特别适用于解决大数据背景下高维预测模型的变量选择问题。

关键词: 网络购物搜索量, 预测模型, 变量选择, 连续小波变换, 模糊聚类

Abstract:

A clustering method for online shopping search data based on Continuous Wavelet Transformation (CWT) and its inverse transformation was proposed for variable selection in predictive model. The method decomposed original series into different periodic components by taking full account of special characteristics of search data and reconstructed such periodic components into input vectors. Clustering was implemented through weighted Fuzzy C-Means (FCM) algorithm. The variables (keywords) were selected according to their membership function values in each group. Variable selection effectiveness was then evaluated through a prediction model for Chinese monthly Consumer Price Index (CPI). The experimental results indicate that search volume series have different periodic components and the keywords within the same group are highly consistent in commodity type. Compared to other variable selection methods, the prediction model based on the wavelet clustering can achieve better prediction accuracy, the one-step and three-step relative prediction errors are 0.3891% and 0.5437% respectively, and the selected variables also have clearly economic meaning. The proposed method is particularly suitable to address variable selection problem of high-dimensional predictive model in the big data era.

Key words: online shopping search volume, predictive model, variable selection, Continuous Wavelet Transformation (CWT), fuzzy clustering

中图分类号:

TP391.4

袁铭. 基于小波的搜索量聚类及在变量选择中的应用[J]. 计算机应用, 2015, 35(3): 802-806.

YUAN Ming. Search data clustering based on wavelet and its application in variable selection[J]. Journal of Computer Applications, 2015, 35(3): 802-806.

参考文献

[1] WANG Y, JIN X, CHENG X. Network big data: present and future[J]. Chinese Journal of Computers, 2013,36(6):1125-1138.(王元卓,靳小龙,程学旗.网络大数据:现状与展望[J].计算机学报,2013,36(6):1125-1138.)
[2] LI G, CHENG X. Research status and scientific thinking of big data[J]. Bulletin of Chinese Academy of Sciences, 2012,27(6):647-657.(李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域-大数据的研究现状与科学思考[J].中国科学院院刊,2012,27(6):647-657.)
[3] LIRAN E, JONATHAN L. The data revolution and economic analysis[J]. Innovation Policy and the Economy, 2014,14(1):1-24.
[4] ZHAO L, LU Z, WANG Z. Empirical research on the relationship between Baidu search volume and stock return[J]. Journal of Financial Research, 2013(4):183-195.(赵龙凯,陆子昱,王致远. 众里寻"股"千百度——股票收益率与百度搜索量关系的实证探究[J].金融研究,2013(4):183-195.)
[5] SONG S, CAO H, YANG K. Investor attention and IPO anomalies-evidence from Google trend volume[J]. Economic Research Journal, 2011(S1):145-155.(宋双杰,曹晖,杨坤. 投资者关注与IPO异象——来自网络搜索量的经验证据[J]. 经济研究,2011(S1):145-155.)
[6] VOSEN S, SCHMIDT T. Forecasting private consumption: survey-based indicators vs. Google trends[J]. Journal of Forecasting, 2011,30(6):565-578.
[7] ASKITAS N, ZIMMERMANN K F. Google econometrics and unemployment forecasting[J]. Applied Economics Quarterly, 2009,55(2):107-120.
[8] McLAREN N, SHANBHOGUE R. Using Internet search data as economic indicators[EB/OL].[2014-06-11]. http://www.docin.com/p-377454710.html.
[9] FAN J, LV J. Sure independence screening for ultrahigh dimensional feature space[J]. Journal of the Royal Statistical Society:Series B, 2008,70(5):849-911.
[10] TIBSHIRANI R. Regression shrinkage and selection via the lasso[J]. Journal of the Royal Statistical Society:Series B, 1996,58(1): 267-288.
[11] ZOU H, HASTIE T. Regularization and variable selection via the elastic net[J]. Journal of the Royal Statistical Society:Series B, 2005,67(2):301-320.
[12] MAUGIS C, CELEUX G, MARTIN-MAGNIETTE M L. Variable selection in model-based clustering: a general variable role modeling[J]. Computational Statistics and Data Analysis, 2009,53(11):3872-3882.
[13] MAUGIS C, CELEUX G, MARTIN-MAGNIETTE M L. Variable selection in model-based discriminant analysis[J]. Journal of Multivariate Analysis, 2011,102(10):1374-1387.
[14] BOUVEYRON C, BRUNET-SAUMARD C. Discriminative variable selection for clustering with the sparse Fisher-EM algorithm[J]. Computational Statistics, 2009,29(3/4):489-513.
[15] SERRA J, ARCOS J L. An empirical evaluation of similarity measures for time series classification[J]. Knowledge-Based Systems, 2014,67:305-314.
[16] AGUIAR-CONRARIAA L, AZEVEDOB N, SOARES M J. Using wavelets to decompose the time-frequency effects of monetary policy[J]. Physica A, 2008,387(12):2863-2878.
[17] AGUIAR-CONRARIAA L, SOARES M J. Business cycle synchronization and the Euro: a wavelet analysis[J]. Journal of Macroeconomics, 2011,33(3):477-489.
[18] BEZDEK J C. Pattern recognition with fuzzy objective function algorithms[M]. New York: Plenum, 1981:43-86.
[19] China Internet Network Information Center (CNNIC). 2013 China's online shopping market research report[EB/OL].[2014-04-16]. https://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201409/P020140901332431510284.pdf.(中国互联网信息中心. 2013年中国网络购物市场研究报告[EB/OL].[2014-04-16]. https://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201409/P020140901332431510284.pdf.)

基于小波的搜索量聚类及在变量选择中的应用

Search data clustering based on wavelet and its application in variable selection

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	郭茂祖, 张雅喆, 赵玲玲. 基于空间语义和个体活动的电动汽车充电站选址方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2819-2827.
[2]	唐海涛, 王红军, 李天瑞. 判别多维标度特征学习[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1323-1329.
[3]	杨博, 段宗涛, 左鹏飞, 肖媛媛, 王艺霖. 融合异构交通态势的事故预测模型[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3625-3631.
[4]	张丰婷, 杨菊花, 任金荟, 金坤. 基于优化变分模态分解和核极限学习机的集装箱吞吐量预测[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2333-2342.
[5]	文敏, 王荣存, 姜淑娟. 基于关系图卷积网络的源代码漏洞检测[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1814-1821.
[6]	张仲华, 赵福媛, 郭钧枫, 赵高长. 柯西自适应回溯搜索与最小二乘支持向量机的集成预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1829-1836.
[7]	徐国保, 陈媛晓, 王骥. 基于图卷积网络的药物靶标关联预测算法[J]. 计算机应用, 2021, 41(5): 1522-1526.
[8]	郭曙杰, 李志华, 蔺凯青. 云环境下基于模糊隶属度的虚拟机放置算法[J]. 计算机应用, 2020, 40(5): 1374-1381.
[9]	刘娟, 黄细霞, 刘晓丽. 基于栈式自编码网络的风机叶片结冰预测[J]. 计算机应用, 2019, 39(5): 1547-1550.
[10]	郑卓然, 郑向伟, 田杰. 基于惩罚误差矩阵的同步预测无线体域网节能方法[J]. 计算机应用, 2019, 39(2): 513-517.
[11]	胡星辰, 申映华, 吴克宇, 程光权, 刘忠. 模糊规则模型的粒度性能指标评估方法[J]. 计算机应用, 2019, 39(11): 3114-3119.
[12]	杨国锋, 戴家才, 刘向君, 吴晓龙, 田延妮. 基于核模糊聚类的动态多子群协作骨干粒子群优化[J]. 计算机应用, 2018, 38(9): 2568-2574.
[13]	于家斌, 尚方方, 王小艺, 许继平, 王立, 张慧妍, 郑蕾. 基于遗传算法改进的一阶滞后滤波和长短期记忆网络的蓝藻水华预测方法[J]. 计算机应用, 2018, 38(7): 2119-2123.
[14]	商建东, 李盼乐, 刘润杰, 李润川. 基于加权时变泊松模型的出租车载客点推荐模型[J]. 计算机应用, 2018, 38(4): 923-927.
[15]	李昆仑, 关立伟, 郭昌隆. 基于聚类和改进共生演算法的云任务调度策略[J]. 计算机应用, 2018, 38(3): 707-714.