非规范化中文地址的行政区划提取算法

doi:10.11772/j.issn.1001-9081.2017.03.876

计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 876-882.DOI: 10.11772/j.issn.1001-9081.2017.03.876

非规范化中文地址的行政区划提取算法

李晓林^1,2, 黄爽^1,2, 卢涛^1,2, 李霖³

1. 武汉工程大学计算机科学与工程学院, 武汉 430205;
2. 智能机器人湖北省重点实验室(武汉工程大学), 武汉 430205;
3. 武汉大学资源与环境科学学院, 武汉 430079

收稿日期:2016-08-26 修回日期:2016-10-18 出版日期:2017-03-10 发布日期:2017-03-22
通讯作者: 黄爽
作者简介:李晓林(1962-),男,湖北孝感人,副教授,硕士,主要研究方向:数据挖掘、机器学习、人工智能;黄爽(1992-),女,湖北武汉人,硕士研究生,主要研究方向:数据挖掘、机器学习、人工智能;卢涛(1980-),男,湖北武汉人,副教授,博士,主要研究方向:图像/视觉处理、计算机视觉、人工智能;李霖(1960-),男,湖北孝感人,教授,博士生导师,博士,主要研究方向:地理语义及本体、三维建模及可视化。
基金资助:
测绘地理信息公益性行业科研专项（201412014）；国家863计划项目（2013AA12A202）；湖北省自然科学基金资助项目（2013CFA125）；武汉工程大学第七届研究生创新基金资助项目（CX2015053）。

Administrative division extracting algorithm for non-normalized Chinese addresses

LI Xiaolin^1,2, HUANG Shuang^1,2, LU Tao^1,2, LI Lin³

1. School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan Hubei 430205, China;
2. Hubei Provincial Key Laboratory of Intelligent Robot(Wuhan Institute of Technology), Wuhan Hubei 430205, China;
3. School of Resource and Environmental Sciences, Wuhan University, Wuhan Hubei 430079, China

Received:2016-08-26 Revised:2016-10-18 Online:2017-03-10 Published:2017-03-22
Supported by:
This work is partially supported by Special Plan of Surveying and Mapping Geographic Information Public Welfare Scientific Research Special Industry (201412014), the National High Technology Research and Development Program (863 Program) (2013AA12A202), the Natural Science Foundation of Hubei Province (2013CFA125), the 7th Graduate Student Innovation Fund Projects of Wuhan Institute of Technology (CX2015053).

摘要/Abstract

摘要： 由于互联网上中文地址的非规范化表达，导致互联网中的中文地址信息在地理位置服务中难以直接应用。针对此问题，提出一种非规范中文地址的行政区划提取算法。首先，对原始数据进行“路”特征词分组预处理；再利用行政区划字典和移动窗口最大匹配算法，从中文地址中提取所有可能的行政区划数据集；然后，利用中文地址行政区划元素之间具有层次关系的特点，建立行政区划条件集合运算规则，对获取的数据集进行集合运算；再利用行政区划匹配度建立一种行政区划集合解析规则，来计算行政区划可信度；最后，得到可信度最大信息量最完整的中文地址的行政区划。利用从互联网中提取的约25万条中文地址数据进行是否采用“路”特征词分组处理以及是否进行可信度计算处理，对算法的可用性进行了验证，并与目前的地址匹配技术进行对比，准确率达到93.51%。

关键词: 集合运算, 行政区划, 中文地址, 移动窗口, 匹配度, 解析规则

Abstract: Chinese addresses on the Internet are always non-normalized, which cannot be used directly in location-based services. To solve the problem, an algorithm to extract administrative divisions from non-normalized Chinese addresses was proposed. Firstly, preprocessing "road" feature word grouping for original data; using administrative division dictionary and moving window maximum matching algorithm, extract all possible administrative region data sets from Chinese address. Then, using the Chinese administrative divisions between the elements of the hierarchical relationship between the characteristics, the administrative set conditional set operation rule was established and the acquired data set was aggregated. using the administrative division of matching, a set of administrative division set rules were established to calculate the credibility of the administrative division. Finally, the credibility of the maximum amount of information the most complete Chinese address of the administrative divisions were obtained. By using the extracted from the Internet about 250000 Chinese address data whether the use of "road" feature word packet processing and whether to carry on the credibility calculation process was verified for the availability of the algorithm, and with the current address matching technology for comparison, the accuracy rate of 93.51%.

Key words: set operation, administrative division, Chinese address, moving window, matching degree, analytical rule

中图分类号:

TP391.1

李晓林, 黄爽, 卢涛, 李霖. 非规范化中文地址的行政区划提取算法[J]. 计算机应用, 2017, 37(3): 876-882.

LI Xiaolin, HUANG Shuang, LU Tao, LI Lin. Administrative division extracting algorithm for non-normalized Chinese addresses[J]. Journal of Computer Applications, 2017, 37(3): 876-882.

参考文献

[1] 李生.自然语言处理的研究与发展[J].燕山大学学报,2013,37(5):377-384.(LI S. Research and development of natural language processing[J]. Journal of Yanshan University, 2013, 37(5):377-384.)
[2] 吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1):28-33.(LYU Y J, ZHAO T J, YANG M J, et al. Leveled unknown Chinese words resolution by dynamic programing[J]. Journal of Chinese Information Processing, 2001, 15(1):28-33.)
[3] 李庆虎,陈玉健,孙家广.一种中文分词词典新机制——双字哈希机制[J].中文信息学报,2003,17(4):13-18.(LI Q H, CHEN Y J, SUN J G. A new dictionary mechanism for Chinese word segmentation[J]. Journal of Chinese Information Processing, 2003, 17(4):13-18.)
[4] 于光.中文分词系统的设计与实现[D].成都:电子科技大学,2012:73.(YU G. Design and implementation of Chinese word segmentation system[D]. Chengdu:University of Electronic Science and Technology of China, 2012:73.)
[5] 郭会,宋关福,马柳青,等.地理编码系统设计与实现[J].计算机工程,2009,35(1):250-252.(GUO H, SONG G F, MA L Q, et al. Design and implementation of address geocoding system[J]. Computer Engineering, 2009, 35(1):250-252.)
[6] 郭文龙.基于SNM算法的大数据量中文地址清洗方法[J].计算机工程与应用,2014,50(5):108-111.(GUO W L. Cleaning approach to large amounts of Chinese address based on SNM algorithm[J]. Computer Engineering and Applications, 2014, 50(5):108-111.)
[7] 徐娟,曹晔,张奇.面向自由文本的中文地址规范化[J].计算机应用与软件,2015,32(8):22-24.(XU J, CAO Y, ZHANG Q. Chinese address standardisation for plain text[J]. Computer Applications and Software, 2015, 32(8):22-24.)
[8] 陈细谦,迟忠先,金妮.城市地理编码系统应用与研究[J].计算机工程,2004,30(23):50-52.(CHEN X Q, CHI Z X, JIN N. Application and study of city geocoding system[J]. Computer Engineering, 2004, 30(23):50-52.)
[9] 宋子辉.自然语言理解的中文地址匹配算法[J].遥感学报,2013,17(4):788-801.(SONG Z H. Address matching algorithm based on Chinese natural language understanding[J]. Journal of Remote Sensing, 2013, 17(4):788-801.)
[10] 赵阳阳,王亮,仇阿根.地址要素识别机制的地名地址分词算法[J].测绘科学,2013,38(5):74-76.(ZHAO Y Y, WANG L, QIU A G. An improved algorithm for address segmentation[J]. Science of Surveying and Mapping, 2013, 38(5):74-76.)
[11] 孙存群,周顺平,杨林.基于分级地名库的中文地理编码[J].计算机应用,2010,30(7):1953-1955.(SUN C Q, ZHOU S P, YANG L. Chinese geo-coding based on classification database of geographical names[J]. Journal of Computer Applications, 2010, 30(7):1953-1955.)
[12] 孙亚夫,陈文斌.基于分词的地址匹配技术[EB/OL].[2016-01-05]. http://xueshu.baidu.com/s?wd=paperuri%3A%284105a7e9cf9ea8588730d99199975503%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fcpfd.cnki.com.cn%2FArticle%2FCPFDTOTAL-DLXX200711001019.htm&ie=utf-8&sc_us=16495669320387933132. (SUN Y F, CHEN W B. Address matching technology based on segmentation[EB/OL].[2016-01-05]. http://xueshu.baidu.com/s?wd=paperuri%3A%284105a7e9cf9ea8588730d99199975503%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fcpfd.cnki.com.cn%2FArticle%2FCPFDTOTAL-DLXX200711001019.htm&ie=utf-8&sc_us=16495669320387933132.)
[13] 程昌秀,于滨.一种基于规则的模糊中文地址分词匹配方法[J].地理与地理信息科学,2011,27(3):26-29.(CHENG C X, YU B. A rule-based segmenting and matching method for fuzzy Chinese addresses[J]. Geography and Geo-Information Science, 2011, 27(3):26-29.)
[14] 张雪英,闾国年,李伯秋,等.基于规则的中文地址要素解析方法[J].地球信息科学学报,2010,12(1):9-16.(ZHANG X Y, LYU G N, LI B Q, et al. Rule-based approach to semantic resolution of Chinese addresses[J]. Journal of Geo-Information Science, 2010, 12(1):9-16.)
[15] 唐静.城市地名地址的编码匹配研究[D].昆明:昆明理工大学,2011:76.(TANG J. Study on city names address matches the encoding[D]. Kunming:Kunming University of Science and Technology, 2011:76.)
[16] 段艳会,李晓林,黄爽.基于条件随机场的中文地址行政区划提取方法[J].武汉工程大学学报,2015,37(11):47-51.(DUAN Y H, LI X L, HUANG S. Extraction of administrative division of Chinese address based on conditional random fields[J]. Journal of Wuhan Institute of Technology, 2015, 37(11):47-51.)
[17] 马照亭,李志刚,孙伟,等.一种基于地址分词的自动地理编码算法[J].测绘通报,2011(2):59-62.(MA Z T, LI Z G, SUN W, et al. An automatic geocoding algorithm based on address segmentation[J]. Bulletin of Surveying and Mapping, 2011(2):59-62.)
[18] GUO H, ZHU H, GUO Z, et al. Address standardization with latent semantic association[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2009:1155-1164.
[19] COLDBERG D W, WILSON J P, KNOBLOCK C A. From text to geographic coordinates:the current state of geocoding[J]. Urban and Regional Information Systems Association, 2007, 19(1):33-46.

非规范化中文地址的行政区划提取算法

Administrative division extracting algorithm for non-normalized Chinese addresses

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

[1]	邹佳彬, 孙伟. 基于提升静态小波变换与联合结构组稀疏表示的多聚焦图像融合[J]. 计算机应用, 2018, 38(3): 859-865.
[2]	杨万春, 张晨曦, 穆斌. 结合语义与事务属性的QoS感知的服务优化选择[J]. 计算机应用, 2016, 36(8): 2207-2212.
[3]	刘树波, 王颖, 刘梦君, 朱光军. 参与式感知中隐私保护的差异化数据分享协议[J]. 计算机应用, 2015, 35(7): 1865-1869.
[4]	冯永张洋. 结合匹配度和语义相似度的Deep Web查询接口模式匹配[J]. 计算机应用, 2012, 32(06): 1688-1691.
[5]	易晓霖吴怡之. 基于模糊综合评价的可穿戴心电信号质量评估[J]. 计算机应用, 2011, 31(12): 3438-3440.
[6]	郭杰刘建永张有亮朱玉. 基于扫描线自适应角度限差法的地面点云滤波[J]. 计算机应用, 2011, 31(08): 2243-2245.
[7]	戴钎，王力生. 基于故障树和规则匹配的故障诊断专家系统[J]. 计算机应用, 2005, 25(09): 2034-2036.