计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 876-882.DOI: 10.11772/j.issn.1001-9081.2017.03.876

• 数据科学与技术 • 上一篇    下一篇

非规范化中文地址的行政区划提取算法

李晓林1,2, 黄爽1,2, 卢涛1,2, 李霖3   

  1. 1. 武汉工程大学 计算机科学与工程学院, 武汉 430205;
    2. 智能机器人湖北省重点实验室(武汉工程大学), 武汉 430205;
    3. 武汉大学 资源与环境科学学院, 武汉 430079
  • 收稿日期:2016-08-26 修回日期:2016-10-18 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 黄爽
  • 作者简介:李晓林(1962-),男,湖北孝感人,副教授,硕士,主要研究方向:数据挖掘、机器学习、人工智能;黄爽(1992-),女,湖北武汉人,硕士研究生,主要研究方向:数据挖掘、机器学习、人工智能;卢涛(1980-),男,湖北武汉人,副教授,博士,主要研究方向:图像/视觉处理、计算机视觉、人工智能;李霖(1960-),男,湖北孝感人,教授,博士生导师,博士,主要研究方向:地理语义及本体、三维建模及可视化。
  • 基金资助:
    测绘地理信息公益性行业科研专项(201412014);国家863计划项目(2013AA12A202);湖北省自然科学基金资助项目(2013CFA125);武汉工程大学第七届研究生创新基金资助项目(CX2015053)。

Administrative division extracting algorithm for non-normalized Chinese addresses

LI Xiaolin1,2, HUANG Shuang1,2, LU Tao1,2, LI Lin3   

  1. 1. School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan Hubei 430205, China;
    2. Hubei Provincial Key Laboratory of Intelligent Robot(Wuhan Institute of Technology), Wuhan Hubei 430205, China;
    3. School of Resource and Environmental Sciences, Wuhan University, Wuhan Hubei 430079, China
  • Received:2016-08-26 Revised:2016-10-18 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by Special Plan of Surveying and Mapping Geographic Information Public Welfare Scientific Research Special Industry (201412014), the National High Technology Research and Development Program (863 Program) (2013AA12A202), the Natural Science Foundation of Hubei Province (2013CFA125), the 7th Graduate Student Innovation Fund Projects of Wuhan Institute of Technology (CX2015053).

摘要: 由于互联网上中文地址的非规范化表达,导致互联网中的中文地址信息在地理位置服务中难以直接应用。针对此问题,提出一种非规范中文地址的行政区划提取算法。首先,对原始数据进行“路”特征词分组预处理;再利用行政区划字典和移动窗口最大匹配算法,从中文地址中提取所有可能的行政区划数据集;然后,利用中文地址行政区划元素之间具有层次关系的特点,建立行政区划条件集合运算规则,对获取的数据集进行集合运算;再利用行政区划匹配度建立一种行政区划集合解析规则,来计算行政区划可信度;最后,得到可信度最大信息量最完整的中文地址的行政区划。利用从互联网中提取的约25万条中文地址数据进行是否采用“路”特征词分组处理以及是否进行可信度计算处理,对算法的可用性进行了验证,并与目前的地址匹配技术进行对比,准确率达到93.51%。

关键词: 集合运算, 行政区划, 中文地址, 移动窗口, 匹配度, 解析规则

Abstract: Chinese addresses on the Internet are always non-normalized, which cannot be used directly in location-based services. To solve the problem, an algorithm to extract administrative divisions from non-normalized Chinese addresses was proposed. Firstly, preprocessing "road" feature word grouping for original data; using administrative division dictionary and moving window maximum matching algorithm, extract all possible administrative region data sets from Chinese address. Then, using the Chinese administrative divisions between the elements of the hierarchical relationship between the characteristics, the administrative set conditional set operation rule was established and the acquired data set was aggregated. using the administrative division of matching, a set of administrative division set rules were established to calculate the credibility of the administrative division. Finally, the credibility of the maximum amount of information the most complete Chinese address of the administrative divisions were obtained. By using the extracted from the Internet about 250000 Chinese address data whether the use of "road" feature word packet processing and whether to carry on the credibility calculation process was verified for the availability of the algorithm, and with the current address matching technology for comparison, the accuracy rate of 93.51%.

Key words: set operation, administrative division, Chinese address, moving window, matching degree, analytical rule

中图分类号: