《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (5): 1488-1495.DOI: 10.11772/j.issn.1001-9081.2024050627

• 人工智能 • 上一篇    

视觉指导的分词和词性标注

田海燕, 黄赛豪, 张栋(), 李寿山   

  1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
  • 收稿日期:2024-05-17 修回日期:2024-10-14 接受日期:2024-10-24 发布日期:2024-11-01 出版日期:2025-05-10
  • 通讯作者: 张栋
  • 作者简介:田海燕(2000—),女,江苏淮安人,硕士研究生,主要研究方向:多模态分析
    黄赛豪(1999—),男,江苏南通人,硕士,主要研究方向:text‑to‑SQL、多模态分析
    张栋(1991—),男,江苏扬州人,副教授,博士,主要研究方向:情感分析、多模态分析
    李寿山(1980—),男,江苏扬州人,教授,博士,主要研究方向:情感分析、多模态分析。
  • 基金资助:
    国家自然科学基金资助项目(62206193)

Visually guided word segmentation and part of speech tagging

Haiyan TIAN, Saihao HUANG, Dong ZHANG(), Shoushan LI   

  1. School of Computer Science and Technology,Soochow University,Suzhou Jiangsu 215006,China
  • Received:2024-05-17 Revised:2024-10-14 Accepted:2024-10-24 Online:2024-11-01 Published:2025-05-10
  • Contact: Dong ZHANG
  • About author:TIAN Haiyan, born in 2000, M. S. candidate. Her research interests include multi-modal analysis.
    HUANG Saihao, born in 1999, M. S. His research interests include text-to-SQL, multi-modal analysis
    ZHANG Dong, born in 1991, Ph. D., associate professor. His research interests include sentiment analysis, multi-modal analysis.
    LI Shoushan, born in 1980, Ph. D., professor. His research interests include sentiment analysis, multi-modal analysis.
  • Supported by:
    National Natural Science Foundation of China(62206193)

摘要:

中文分词(WS)和词性(POS)标注可以有效帮助其他下游任务,如知识图谱创建和情感分析。但现有工作通常仅利用纯文本信息进行WS和POS标注,忽略了网络中许多与之相关的图片和视频信息。针对这一现状,尝试从这些视觉信息中挖掘相关线索,以帮助进行中文WS和POS标注。首先,制定一系列详细的数据标注规范,并基于微博推文中的文本和图像内容,使用WS和POS标签标注了一个多模态数据集VG-Weibo;其次,提出2种具有不同解码机制的多模态信息融合方法:VGTD(Visually Guided Two-stage Decoding model)和VGCD(Visually Guided Collapsed Decoding model)完成联合WS和POS标注的任务。其中:VGTD方法采用交叉注意力机制融合文本和图像信息,并通过两阶段解码策略,先预测可能的词语跨度,再预测相应的标签;VGCD方法也采用交叉注意力机制融合文本和图像信息,并采用了更适当的中文表示以及合并解码策略。在VG-Weibo测试集上的实验结果表明,在WS和POS标注任务上,VGTD方法的F1得分比传统的纯文本方法的两阶段解码模型(TD)分别提升了0.18和0.22个百分点;VGCD方法的F1得分比传统的纯文本方法的合并解码模型(CD)分别提升了0.25和0.55个百分点。可见,VGTD和VGCD方法都能有效利用视觉信息提升WS和POS标注的性能。

关键词: 分词, 词性标注, 多模态数据, 视觉信息, 社交媒体

Abstract:

Chinese Word Segmentation (WS) and Part-Of-Speech (POS) tagging can assist other downstream tasks such as knowledge graph construction and sentiment analysis effectively. Existing work typically only uses pure-text information for WS and POS tagging. However, the Web also contains many associated image and video information. Therefore, efforts were made to mine associated clues from this visual information to aid Chinese WS and POS tagging. Firstly, a series of detailed annotation standards were established, and with WS and POS tagging, a multimodal dataset VG-Weibo was annotated using the text and image content from Weibo posts. Then, two multimodal information fusion methods, VGTD (Visually Guided Two-stage Decoding model) and VGCD (Visually Guided Collapsed Decoding model), with different decoding mechanisms were proposed to accomplish this joint task of WS and POS tagging. Among the above, in VGTD method, a cross-attention mechanism was adopted to fuse textual and visual information and a two-stage decoding strategy was employed to firstly predict possible word spans and then predict the corresponding tags; in VGCD method, a cross-attention mechanism was also utilized to fuse textual and visual information and more appropriate Chinese representation and a collapsed decoding strategy were used. Experimental results on VG-Weibo test set demonstrate that on WS and POS tagging tasks, the F1 scores of VGTD method are improved by 0.18 and 0.22 percentage points, respectively, compared to those of the traditional pure-text method's Two-stage Decoding model (TD); the F1 scores of VGCD method are improved by 0.25 and 0.55 percentage points, respectively, compared to the traditional pure-text method's Collapsed Decoding model (CD). It can be seen that both VGTD and VGCD methods can utilize visual information effectively to enhance the performance of WS and POS tagging.

Key words: Word Segmentation (WS), Part-Of-Speech (POS) tagging, multimodal data, visual information, social media

中图分类号: