Data crawler for Sina Weibo based on Python

doi:10.11772/j.issn.1001-9081.2014.11.3131

Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (11): 3131-3134.DOI: 10.11772/j.issn.1001-9081.2014.11.3131

Previous Articles Next Articles

Data crawler for Sina Weibo based on Python

ZHOU Zhonghua¹,ZHANG Huiran¹,XIE Jiang¹,²

1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
2. High Performance Computing Center, Shanghai University, Shanghai 200444, China

Received:2014-07-28 Revised:2014-08-04 Online:2014-11-01 Published:2014-12-01
Contact: ZHOU Zhonghua

基于Python的新浪微博数据爬虫

周中华¹,张惠然¹,谢江¹,²

1. 上海大学计算机工程与科学学院,上海 200444
2. 上海大学高性能计算中心,上海 200444

通讯作者: 周中华
作者简介:周中华(1989-),男江苏常州人,硕士研究生,CCF会员,主要研究方向:生物信息、高性能计算;张惠然(1981-),男河南新乡人,讲师,博士,CCF会员,主要研究方向: 生物信息、高性能计算;谢江(1971-),女,湖北恩施人,副教授,博士,CCF会员,主要研究方向: 生物信息、高性能计算。
基金资助:
国家自然科学基金资助项目;高等学校博士学科点专项科研基金资助项目;上海市科委重点项目

Abstract

Abstract:

Nowadays, most of researches about social network use data from foreign social network platforms. However the largest social network platform Sina Weibo in China has no data interfaces for investors. A Sina Weibo data crawler combined with parallelization technology was put forward. It got fans information and Weibo data content of different weibo users in real-time. It also supported key words matching and parallelization. The serial data crawler and its parallel version were compared, and an experiment about flu was conducted on some Weibo data. The results indicate that, with parallelization, this tool has liner speedup and all the fetching data are with timeliness and accuracy.

摘要：

目前很多的社交网络研究都是采用国外的平台数据,而国内的新浪微博没有很好的接口方便研究人员采集数据进行分析。为了快速地获取到微博中的数据,开发了一款支持并行的微博数据抓取工具。该工具可以实时抓取微博中指定用户的粉丝信息、微博正文等内容;该工具利用关键字匹配技术,匹配符合规定条件的微博,并抓取相关内容;该工具支持并行抓取,可以同时抓取多个用户的信息。最后将串行微博爬虫工具和其并行版本进行对比,并使用该工具对部分微博数据作了一个关于流感问题的分析。实验结果显示:并行爬虫拥有较好的加速比,可以快速地获取数据,并且这些数据具有实时性和准确性。

CLC Number:

TP391
TP311

ZHOU Zhonghua ZHANG Huiran XIE Jiang. Data crawler for Sina Weibo based on Python[J]. Journal of Computer Applications, 2014, 34(11): 3131-3134.

周中华张惠然谢江. 基于Python的新浪微博数据爬虫[J]. 计算机应用, 2014, 34(11): 3131-3134.

References

[1]TUMASJAN A, SPRENGER T O, SANDNER P G, et al. Predicting elections with Twitter: what 140 characters reveal about political sentiment[C]// Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media. Madison: AAAI Press, 2010, 10: 178-185.
[2]WELCH M J, SCHONFELD U, HE D, et al. Topical semantics of twitter links[C]// Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2011: 327-336.
[3]CARLISLE J E, PATTON R C. Is social media changing how we understand political engagement? An analysis of Facebook and the 2008 presidential election[J]. Political Research Quarterly, 2013, 66(4): 883-895.
[4]CUNLIFFE D, MORRIS D, PRYS C. Young bilinguals' language behaviour in social networking sites: the use of welsh on Facebook[J]. Journal of Computer-Mediated Communication, 2013, 18(3): 339-361.
[5]STRAFLING N, KRAMER N C. Learning together on Facebook et al. The influence of social aspects and personality on the usage of social media for study related exchange [J]. Gruppendynamik und Organisationsberatung, 2013, 44(4): 409-428.
[6]DUAN J Y, DHOLAKIA N. The reshaping of Chinese consumer values in the social media era: exploring the impact of Weibo [J]. Journal of Macromarketing, 2013, 33(4): 402-403.
[7]HUANG R, SUN X. Weibo network, information diffusion and implications for collective action in China [J]. Information Communication and Society， 2014, 17(1): 86-104.
[8]MAZO J. Blocked on Weibo: what gets suppressed on China's version of Twitter (and why) [J]. Survival, 2013, 55(6): 191-192.
[9]POELL T, de KLOET J, ZENG G, et al. Will the real Weibo please stand up? Chinese online contention and actor-network theory [J]. Chinese Journal of Communication, 2014,7(1): 1-18.
[10]PINKERTON B. Finding what people want: experiences with the WebCrawler[EB/OL]. [2010-10-10]. http://www.webir.org/resources/phd/pinkerton_2000.pdf.
[11]AHMADI-ABKENARI F, SELAMAT A. An architecture for a focused trend parallel Web crawler with the application of clickstream analysis[J]. Information Sciences, 2012, 184(1): 266-281.
[12]ZHOU L, LIN L. Survey on the research of focused crawling techn ique [J]. Computer Applications, 2005, 25(9): 1965-1969 (周立柱, 林玲. 聚焦爬虫技术研究综述[J]. 计算机应用, 2005, 25(9): 1965-1969.)
[13]BASTIAN M, HEYMANN S, JACOMY M. Gephi: an open source software for exploring and manipulating networks[EB/OL]. [2010-10-10]. https://gephi.org/publications/gephi-bastian-feb09.pdf.

[1]	. EfficientNet based dual-branch multi-scale integrated learning for pedestrian re-identification [J]. Journal of Computer Applications, 0, (): 0-0.
[2]	. Safety helmet wearing detection method based on improved YOLOv5("NCCA 2021 Recommendation") [J]. Journal of Computer Applications, 0, (): 0-0.
[3]	LI Xuanyi, ZHOU Yun. BNSL-FIM: Bayesian network structure learning algorithm based on frequent item mining [J]. Journal of Computer Applications, 0, (): 0-0.
[4]	ZHANG Huali，KANG Xiaodong，LI Bo，WANG Yage，LIU Hanqing，BAI Fang. Medical name entity recognition based on Bi-LSTM-CRF and attention mechanism [J]. Journal of Computer Applications, 0, (): 0-0.
[5]	. Stereo matching algorithm based on image segmentation [J]. Journal of Computer Applications, 0, (): 0-0.
[6]	. Predicting stock closing price using Adaptive Whale Optimization Algorithm and Elman neural network [J]. Journal of Computer Applications, 0, (): 0-0.
[7]	. Saliency Detection of Deep Features Guidance [J]. Journal of Computer Applications, 0, (): 0-0.
[8]	. Multi-attribute Decision Making Method for Pythagorean Fuzzy Frank Operators [J]. Journal of Computer Applications, 0, (): 0-0.
[9]	. Embedded Real-time Compression of Hyper-spectral Images Based on KLT and HEVC [J]. Journal of Computer Applications, 0, (): 0-0.
[10]	. A Playback Speech Detection Algorithm Based on Modified Cepstrum Feature [J]. Journal of Computer Applications, 0, (): 0-0.
[11]	. RGB-D saliency detection based on improved LBE feature [J]. Journal of Computer Applications, 0, (): 0-0.
[12]	. Adaptive Image Matching Algorithm Based on SIFT Operator with Maximum Dissimilarity Coefficient [J]. Journal of Computer Applications, 0, (): 0-0.
[13]	. A link prediction method for complex network based on the closeness between nodes [J]. Journal of Computer Applications, 0, (): 0-0.
[14]	HUANG Wei FU Liqin WANG Chen. Texture-preserving shadow removal algorithm based on gradient domain [J]. Journal of Computer Applications, 2013, 33(08): 2317-2319.
[15]	WANG Kai LIU Jiajia YUAN Jianying JIANG Xiaoliang XIONG Ying LI Bailin. Noise reduction of optimization weight based on energy of wavelet sub-band coefficients [J]. Journal of Computer Applications, 2013, 33(08): 2341-2345.

Data crawler for Sina Weibo based on Python

基于Python的新浪微博数据爬虫

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics