Questions are conditioned on the title, abstract, figure, caption, and surrounding paper passages.
Overview
Existing figure QA often asks what is directly visible. Multimodal QUD asks what scientific readers wonder next.
Existing scientific figure QA benchmarks often focus on extracting visible information: labels, values, captions, or direct visual comparisons. Multimodal QUD instead targets questions raised by figure-text interaction, including mechanisms, evidence, implications, and the role a figure plays in the paper's argument.
The task captures open-ended questions that emerge when visual patterns interact with a paper's claims.
Each example pairs the visible figure with answer evidence from the surrounding paper context.
Representative questions
From extracting answers to asking why
Representative questions show how MQUD moves beyond reading off values or best-performing methods toward explaining why a figure pattern matters.
-
ChartQA
value extraction
What is the difference in value between Lamb and Corn?
-
SPIQA
figure lookup
Which pooling method is the most accurate for the AFEW database?
-
Multimodal QUD
figure-text inquiry
Why does the attention shift from text to image after the intervention?
arXiv preprint BibTeX
@misc{wu2026multimodalqud,
title={Multimodal QUD: Inquisitive Questions from Scientific Figures},
author={Wu, Yating and Rudman, William and Govindarajan, Venkata S. and Dimakis, Alexandros G. and Li, Junyi Jessy},
year={2026},
eprint={2604.23733},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.23733}
}