Abstract:
Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts about
the visual objects and their relationships, largely neglecting fine-grained scene understanding. In fact, many
data-driven applications on the Web (e.g., news-reading and e-shopping) require accurate recognition of
much less coarse concepts as entities and proper linking them to a knowledge graph (KG), which can take
their performance to the next level. In light of this, in this paper, we identify a new research task: visual entity
linking for fine-grained scene understanding. To accomplish the task, we first extract features of candidate
entities from different modalities, i.e., visual features, textual features, and KG features. Then, we design a
deep modal-attention neural network-based learning-to-rank method which aggregates all features and maps
visual objects to the entities in KG. Extensive experimental results on the newly constructed dataset show
that our proposed method is effective as it significantly improves the accuracy performance from 66.46% to
83.16% compared with baselines.