Journal of Computer Applications

    Next Articles

Visually guided word segmentation and part of speech

  

  • Received:2024-05-17 Revised:2024-10-14 Accepted:2024-10-24 Online:2024-11-01 Published:2024-11-01
  • Contact: dong zhang

视觉指导的分词和词性标注

田海燕,黄赛豪,张栋,李寿山   

  1. 苏州大学
  • 通讯作者: 张栋
  • 基金资助:
    面向对话的多模态情绪信息抽取研究

Abstract: Word segmentation and tagging parts of speech are two valuable foundational tasks that can effectively assist other downstream tasks, such as knowledge graph creation and sentiment analysis. Existing work typically uses only plain text information for word segmentation (WS) and part-of-speech (POS) tagging. However, the web also contains many associated images and video information. Therefore, efforts were made to mine clues from this visual information to aid WS & POS tagging. Firstly, by establishing a series of detailed annotation standards, a multimodal dataset VG-Weibo was annotated using the text and image content from Weibo tweets, with WS & POS labels. Secondly, two multimodal fusion methods (VGTD and VGCD) with different decoding mechanisms were proposed to accomplish this joint task of WS and POS tagging. Specifically, VGTD adopted a cross-attention mechanism to fuse textual and visual information and employed a two-stage decoding strategy to first predict possible word spans and then predict corresponding part-of-speech labels. The F1 score of the VGTD method improves by 0.18 and 0.22 percentage points compared to the traditional pure-text method (TD) on WS and POS tagging tasks, respectively. Meanwhile, VGCD also utilized a cross-attention mechanism to fuse textual and visual information but adopted more appropriate Chinese representation and a collapsed decoding strategy, achieving F1 score improvements of 0.25 and 0.55 percentage points compared to the traditional pure-text method (CD) on WS and POS tagging tasks, respectively. Ultimately, experimental results on the VG-Weibo dataset demonstrate that both VGTD and VGCD methods effectively utilize visual information to enhance the performance of word segmentation and part-of-speech tagging.

Key words: word segmentation, part-of-speech tagging, multimodal data, visually, social media

摘要: 对句子进行分词和词性标注是2项有价值的基础任务,可以有效帮助其他下游任务,如知识图谱创建和情感分析。现有的工作通常仅利用纯文本信息进行分词(WS)和词性(POS)标注。然而,网络还包括许多与之相关的图片和视频信息。基于这一现状,尝试从这些视觉信息中挖掘相关线索,以帮助进行中文分词和词性标注。首先,制定一系列详细的数据标注规范,基于微博推文中的文本和图像内容,使用分词和词性标签标注了一个多模态数据集VG-Weibo。其次,提出了2种具有不同解码机制的多模态信息融合方法(VGTD和VGCD)完成这一联合分词和词性标注任务。其中,VGTD采用交叉注意力机制融合文本和图像信息,并通过两阶段解码策略,先预测可能的词语跨度,再预测相应的标签。在分词和词性标注任务上,VGTD的F1值比传统的纯文本方法TD分别提升了0.18和0.22个百分点。VGCD也采用交叉注意力机制融合文本和图像信息,但采用了更适当的中文表示以及合并解码策略,它的F1值比传统的纯文本方法CD分别提升了0.25和0.55个百分点。VG-Weibo数据集上的实验结果表明,VGTD和VGCD方法都能有效利用视觉信息来提升分词和词性标注的性能。

关键词: 分词, 词性标注, 多模态数据, 视觉, 社交媒体

CLC Number: