Multi-category active learning algorithm based on multiple clustering algorithms and multivariate linear regression
WANG Min1, WU Yubo1, MIN Fan2
1. School of Electrical Engineering and Information, Southwest Petroleum University, Chengdu Sichuan 610500, China; 2. School of Computer Science, Southwest Petroleum University, Chengdu Sichuan 610500, China
Abstract:Concerning the problem that traditional lithology identification methods have low recognition accuracy and are difficult to integrate with geological experience organically, a multi-category Active Learning algorithm based on multiple Clustering algorithms and multivariate Linear regression algorithm (ALCL) was proposed. Firstly, the category matrix corresponding to each algorithm was obtained through multiple heterogeneous clustering algorithms, and the category matrices were labeled and pre-classified by querying common points. Secondly, the key examples used to train the weight coefficient model of the clustering algorithm were selected through the proposed priority largest search strategy and the most confusing query strategy. Thirdly, the objective solving function was defined, and the weight coefficients of clustering algorithms were obtained by training the key examples. Finally, the samples with high confidence in the results were classified by performing the classification calculation combined with the weight coefficient. Six public lithology datasets of oil wells in Daqing oilfield were used to carry out experiments. Experimental results show that when the classification accuracy of ALCL is the highest, it is improved by 2.07%-14.01% compared with those of the traditional supervised learning algorithms and other active learning algorithms. The results of hypothesis test and significance analysis prove that ALCL has better classification effect in lithology identification.
汪敏, 武禹伯, 闵帆. 基于多种聚类算法和多元线性回归的多分类主动学习算法[J]. 计算机应用, 2020, 40(12): 3437-3444.
WANG Min, WU Yubo, MIN Fan. Multi-category active learning algorithm based on multiple clustering algorithms and multivariate linear regression. Journal of Computer Applications, 2020, 40(12): 3437-3444.
[1] 许永忠, 杨海军. 地震反演技术在岩性及火成岩识别中的研究与应用[M]. 徐州:中国矿业大学出版社, 2012:1-17.(XU Y Z, YANG H J. Research and Application of Seismic Inversion Technology in Lithology and Igneous Rock Identification[M]. Xuzhou:China University of Mining and Technology Press,2012:1-17.) [2] 乔宝明, 马继丰, 伊晓玲. 多元统计方法与计算[M]. 徐州:中国矿业大学出版社, 2018:1-27.(QIAO B M,MA J F,YI X L. Multivariate Statistical Methods and Calculations[M]. Xuzhou:China University of Mining and Technology Press,2018:1-27.) [3] 孔祥玉, 冯晓伟, 胡昌华. 广义主成分分析算法及应用[M]. 北京:国防工业出版社, 2018:47-75.(KONG X Y,FENG X W,HU C H. General Principal Component Analysis and Application[M]. Beijing:National Defense Industry Press,2018:47-75.) [4] 谢季坚, 刘承平. 模糊数学方法及其应用[M]. 4版. 武汉:华中科技大学出版社,2013:1-37. (XIE J J,LIU C P. Fuzzy Mathematics Method and Its Application[M]. 4th ed. Wuhan:Huazhong University of Science and Technology Press, 2013:1-37.) [5] 王快妮. 支持向量机鲁棒性模型与算法研究[M]. 北京:北京邮电大学出版社, 2019:6-12.(WANG K N. Research on Robustness Model and Algorithm of Support Vector Machine[M]. Beijing:Beijing University of Posts and Telecommunications Press,2019:6-12.) [6] 韩力群, 施彦. 人工神经网络理论及应用[M]. 北京:机械工业出版社, 2017:1-40.(HAN L Q,SHI Y. Artificial Neural Network Theory and Application[M]. Beijing:China Machine Press, 2017:1-40.) [7] 陈伏兵, 陈秀宏, 高秀梅, 等. 二维主成分分析方法的推广及其在人脸识别中的应用[J]. 计算机应用, 2005, 25(8):1767-1770. (CHEN F B,CHEN X H,GAO X M,et al. Generalization of 2DPCA and its application in face recognition[J]. Journal of Computer Applications,2005,25(8):1767-1770.) [8] 周非, 夏鹏程. 基于主成分分析和卡方距离的信号强度差指纹定位算法[J]. 计算机应用, 2019, 39(5):1405-1410.(ZHOU F, XIA P C. Signal strength difference fingerprint localization algorithm based on principalcomponent analysis and chi-square distance[J]. Journal of Computer Applications,2019,39(5):1405-1410.) [9] 张进, 丁胜, 李波. 改进的基于粒子群优化的支持向量机特征选择和参数联合优化算法[J]. 计算机应用, 2016, 36(5):1330-1335. (ZHANG J, DING S, LI B. Improved particle swarm optimization algorithm for support vector machine feature selection and optimization of parameters[J]. Journal of Computer Applications,2016,36(5):1330-1335.) [10] 章少平, 梁雪春. 优化的支持向量机集成分类器在非平衡数据集分类中的应用[J]. 计算机应用, 2015, 35(5):1306-1309. (ZHANG S P,LIANG X C. Applications of unbalanced data classification based on optimized support vector machine ensemble classifier[J]. Journal of Computer Applications,2015,35(5):1306-1309.) [11] 史兴宇, 邓洪敏, 林宇锋, 等. 基于人工神经网络的数字识别[J]. 计算机应用, 2017, 37(S1):187-189.(SHI X Y,DENG H M,LIN Y F,et al. Figure identification based on artificial neuralnetwork[J]. Journal of Computer Applications,2017,37(S1):187-189.) [12] 程宇, 邓德祥, 颜佳, 等. 基于卷积神经网络的弱光照图像增强算法[J]. 计算机应用, 2019, 39(4):1162-1169.(CHENG Y, DENG D X, YAN J, et al. Weakly illuminated image enhancement algorithm based on convolutional neuralnetwork[J]. Journal of Computer Applications,2019,39(4):1162-1169.) [13] WANG M,LIN Y,MIN F,et al. Cost-sensitive active learning through statistical methods[J]. Information Sciences, 2019, 501:460-482. [14] WANG M,FU K,MIN F,et al. Active learning through label error statistical methods[J]. Knowledge-Based Systems,2020, 189:Article No. 105140. [15] WANG M,MIN F,ZHANG Z,et al. Active learning through density clustering[J]. Expert Systems with Applications,2017, 85:305-317. [16] 贾俊芳. 基于层次聚类的主动学习方法——HC_AL[J]. 计算机应用, 2011, 31(8):2134-2137.(JIA J F. HC_AL:new active learning method based on hierarchical clustering[J]. Journal of Computer Applications,2011,31(8):2134-2137.) [17] 龙军, 章成源. 数据仓库与数据挖掘[M]. 长沙:中南大学出版社, 2018:154-156.(LONG J,ZHANG C Y. Data Warehouse and Data Mining[M]. Changsha:Central South University Press, 2018:154-156.) [18] REYES O,ALTALHI A H,VENTURA S. Statisticalcomparisons of active learning strategies over multiple datasets[J]. KnowledgeBased Systems,2018,145:274-288. [19] 周游, 张广智, 高刚, 等. 核主成分分析法在测井浊积岩岩性识别中的应用[J]. 石油地球物理勘探, 2019, 54(3):667-675. (ZHOU Y,ZHANG G Z,GAO G,et al. Application of nuclear principalcomponent analysis method in logging turbidite lithology identification[J]. Oil Geophysical Prospecting,2019,54(3):667-675.) [20] 杨兆栓, 林畅松, 尹宏, 等. 主成分分析在塔中地区奥陶系鹰山组碳酸盐岩岩性识别中的应用[J]. 天然气地球科学, 2015, 26(1):54-59.(YANG Z S,LIN C S,YIN H,et al. Application of principalcomponent analysis in carbonate lithology identification of the Ordovician Yingshan formation in Tazhong area[J]. Natural Gas Geoscience,2015,26(1):54-59.) [21] 张昭杰, 方石. 基于遗传算法优化的支持向量机在岩性识别中的应用[J]. 世界地质, 2019, 38(2):486-491.(ZHANG Z J, FANG S. Application of support vector machine in lithology identification based on genetic algorithm optimization[J]. World Geology,2019,38(2):486-491.) [22] 苏赋, 马磊, 罗仁泽, 等. 基于改进多分类孪生支持向量机的测井岩性识别方法研究与应用[J]. 地球物理学进展, 2020, 35(1):174-180.(SU F,MA L,LUO R Z,et al. Research and application of logging lithology identification method based on improved multi-class twin support vector machine[J]. Progress in Geophysics,2020,35(1):174-180.) [23] 单敬福, 陈欣欣, 赵忠军, 等. 利用BP神经网络法对致密砂岩气藏储集层复杂岩性的识别[J]. 地球物理学进展, 2015, 30(3):1257-1263. (SHAN J F, CHEN X X, ZHAO Z J, et al. Identification ofcomplex lithology for tight sandstone gas reservoirs on BP neuralnetwork method[J]. Progress in Geophysics,2015, 30(3):1257-1263.) [24] 陈钢花, 梁莎莎, 王军, 等. 卷积神经网络在岩性识别中的应用[J]. 测井技术, 2019, 43(2):129-134.(CHEN G H,LIANG S S, WANG J,et al. Application of convolutional neuralnetwork in lithology identification[J]. Well Logging Technology,2019,43(2):129-134.) [25] HARTIGAN J A,WONG M A. A K-means clustering algorithm[J]. Journal of the Royal Statistical Society:Series C(Applied Statistics),1979,28(1):100-108. [26] RODRIGUEZ A,LAIO A. Clustering by fast search and find of density peaks[J]. Science,2014,344(6191):1492-1496. [27] 林开颜, 徐立鸿, 吴军辉. 快速模糊C均值聚类彩色图像分割方法[J]. 中国图象图形学报, 2004, 9(2):159-163.(LIN K Y,XU L H,WU J H. A fast fuzzy C-means clustering for color image segmentation[J]. Journal of Image and Graphics,2004,9(2):159-163.) [28] 淦文燕, 李德毅, 王建民. 一种基于数据场的层次聚类方法[J]. 电子学报, 2006, 34(2):258-262.(GAN W Y,LI D Y,WANG J M. Hierarchical clustering method based on data field[J]. Acta Electronica Sinica,2006,34(2):258-262.) [29] 王惠文, 孟洁. 多元线性回归的预测建模方法[J]. 北京航空航天大学学报, 2007, 33(4):500-504.(WANG H W,MENG J. Predictive modeling on multivariate linear regression[J]. Journal of Beijing University of Aeronautics and Astronautics,2007,33(4):500-504.) [30] YANG Y,CHEN W. Taiga:performance optimization of the C4.5 decision tree construction algorithm[J]. Tsinghua Science and Technology,2016,21(4):415-425. [31] SEIN M, 傅顺开, 吕天依, 等. 一般贝叶斯网络分类器及其学习算法[J]. 计算机应用研究, 2016, 33(5):1327-1334.(SEIN M, FU S K,LYU T Y,et al. Algorithm for exact recovery of Bayesiannetwork for classification[J]. Application Research of Computers,2016,33(5):1327-1334.) [32] SEUNG H S,OPPER M,SOMPOLINSKY H. Query bycommittee[C]//Proceedings of the 19925th Annual Workshop on Computational Learning Theory. New York:ACM, 1992:287-294. [33] WANG M,FU K,MIN F. Active learning through two-stage clustering[C]//Proceedings of the 2018 IEEE International Conference on Fuzzy Systems. Piscataway:IEEE,2018:1-7.