Visual Entity Linking via Multi-modal Learning postprint

Author: Qiushuo, Zheng ¹ Hao, Wen ² Meng, Wang ^2,3 Guilin, Qi ^2,3
Institute:

1. School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China

2. School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

3. Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing 211189, China
Correspondent： Guilin, Qi Email:gqi@seu.edu.cn
Submit Time:2022-11-28 10:44:25

Abstract: Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts about the visual objects and their relationships, largely neglecting fine-grained scene understanding. In fact, many data-driven applications on the Web (e.g., news-reading and e-shopping) require accurate recognition of much less coarse concepts as entities and proper linking them to a knowledge graph (KG), which can take their performance to the next level. In light of this, in this paper, we identify a new research task: visual entity linking for fine-grained scene understanding. To accomplish the task, we first extract features of candidate entities from different modalities, i.e., visual features, textual features, and KG features. Then, we design a deep modal-attention neural network-based learning-to-rank method which aggregates all features and maps visual objects to the entities in KG. Extensive experimental results on the newly constructed dataset show that our proposed method is effective as it significantly improves the accuracy performance from 66.46% to 83.16% compared with baselines.

Knowledge graph Multi-modal learning Entity linking Learning to rank Knowledge graph representation

Subject: Computer Science >> Integration Theory of Computer Science
Cite as: ChinaXiv:202211.00385 (or this version ChinaXiv:202211.00385V1)
DOI:10.1162/ dint_a_00114
CSTR:32003.36.ChinaXiv.202211.00385.V1
TXID： 5f5f6a72-5dfb-4c79-a99d-2fca4465d0f5
Recommended references： Qiushuo, Zheng,Hao, Wen,Meng, Wang,Guilin, Qi.Visual Entity Linking via Multi-modal Learning.中国科学院科技论文预发布平台.[DOI:10.1162/ dint_a_00114] (Click&Copy)

Version History

[V1]

2022-11-28 10:44:25

ChinaXiv:202211.00385V1

Download

Related Paper

1. MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning	2025-06-10
2. Semantic structures within natural language and their cognitive functions	2025-06-03
3. DO-RAG: A Domain-Specific QA Framework Using Knowledge Graph-Enhanced Retrieval-Augmented Generation	2025-05-20
4. What surface characteristics truly affect thermal contact resistance -- An interpretability study based on deep learning and convolutional neural networks	2025-04-11
5. The Thermal Contact Resistance Dataset and the Artificial Intelligence-Driven Prediction of Thermal Contact Resistance in Multi-material Systems	2025-04-11
6. Individual-to-Individual EEG Conversion Using Swin Transformer	2025-03-01
7. FairSort: Learning to Fair Rank for PersonalizedRecommendations in Two-Sided Platforms	2024-12-03
8. Orthogonal Mode Decomposition for Finite Discrete Signals	2024-11-30
9. Animating the Past: Reconstruct Trilobite via Video Generation	2024-11-12
10. DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving	2024-09-14


Public comments Anonymous comments Send only to author