Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval

Tan Yu, Yi Yang, Hongliang Fei, Yi Li, Xiaodong Chen, Ping Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we tackle the cross-lingual language-to-vision (CLLV) retrieval task. In the CLLV retrieval task, given the text query in one language, it seeks to retrieve the relevant images/videos from the database based on visual content in images/videos and their captions in another language. As the CLLV retrieval bridges the modal gap and the language gap, it makes many international cross-modal applications feasible. To tackle the CLLV retrieval, in this paper, we propose an assorted attention network (A2N) to synchronously overcome the language gap, bridge the modal gap and fuse features of two modals in an elegant and effective manner. It represents each text query as a set of word features and represents each image/video as a set of its caption's word features in another language and a set of its local visual features. In this case, the relevance between the text query and the image/video is obtained by the matching between the set of query's word features and two sets of image/video features. To enhance the effectiveness of the matching, A2N merges the query's word features and the image/video's visual and word features into an assorted set and further conducts the self-attention operation on items of the assorted set. On one hand, benefited from the attentions between the query's word features and the video/image's visual features, some important word features or visual features of the image/video can be emphasized. On the other hand, benefited from the attentions between the video/image's visual features and its caption word features, the image/video's visual content and the text information can be fused in a more effective manner. Systematic experiments conducted on four datasets demonstrate the effectiveness of the proposed A2N in the CLLV retrieval task.

Original languageAmerican English
Title of host publicationCIKM 2021 - Proceedings of the 30th ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages2444-2454
Number of pages11
ISBN (Electronic)9781450384469
DOIs
StatePublished - Oct 26 2021
Event30th ACM International Conference on Information and Knowledge Management, CIKM 2021 - Virtual, Online, Australia
Duration: Nov 1 2021Nov 5 2021

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference30th ACM International Conference on Information and Knowledge Management, CIKM 2021
Country/TerritoryAustralia
CityVirtual, Online
Period11/1/2111/5/21

ASJC Scopus subject areas

  • General Business, Management and Accounting
  • General Decision Sciences

Keywords

  • computer vision
  • cross-lingual
  • cross-modal
  • deep learning
  • natural language understanding
  • retrieval
  • search

Fingerprint

Dive into the research topics of 'Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval'. Together they form a unique fingerprint.

Cite this