Chinese Word Segmentation (WS) and Part-Of-Speech (POS) tagging can assist other downstream tasks such as knowledge graph construction and sentiment analysis effectively. Existing work typically only uses pure-text information for WS and POS tagging. However, the Web also contains many associated image and video information. Therefore, efforts were made to mine associated clues from this visual information to aid Chinese WS and POS tagging. Firstly, a series of detailed annotation standards were established, and with WS and POS tagging, a multimodal dataset VG-Weibo was annotated using the text and image content from Weibo posts. Then, two multimodal information fusion methods, VGTD (Visually Guided Two-stage Decoding model) and VGCD (Visually Guided Collapsed Decoding model), with different decoding mechanisms were proposed to accomplish this joint task of WS and POS tagging. Among the above, in VGTD method, a cross-attention mechanism was adopted to fuse textual and visual information and a two-stage decoding strategy was employed to firstly predict possible word spans and then predict the corresponding tags; in VGCD method, a cross-attention mechanism was also utilized to fuse textual and visual information and more appropriate Chinese representation and a collapsed decoding strategy were used. Experimental results on VG-Weibo test set demonstrate that on WS and POS tagging tasks, the F1 scores of VGTD method are improved by 0.18 and 0.22 percentage points, respectively, compared to those of the traditional pure-text method's Two-stage Decoding model (TD); the F1 scores of VGCD method are improved by 0.25 and 0.55 percentage points, respectively, compared to the traditional pure-text method's Collapsed Decoding model (CD). It can be seen that both VGTD and VGCD methods can utilize visual information effectively to enhance the performance of WS and POS tagging.