Multimodal QUD — Inquisitive Questions from Scientific Figures

Overview

Existing figure QA often asks what is directly visible. Multimodal QUD asks what scientific readers wonder next.

Existing scientific figure QA benchmarks often focus on extracting visible information: labels, values, captions, or direct visual comparisons. Multimodal QUD instead targets questions raised by figure-text interaction, including mechanisms, evidence, implications, and the role a figure plays in the paper's argument.

Input Figure plus paper context

Questions are conditioned on the title, abstract, figure, caption, and surrounding paper passages.

Target Researcher-like curiosity

The task captures open-ended questions that emerge when visual patterns interact with a paper's claims.

Grounding Answer traces

Each example pairs the visible figure with answer evidence from the surrounding paper context.

Representative questions

From extracting answers to asking why

Representative questions show how MQUD moves beyond reading off values or best-performing methods toward explaining why a figure pattern matters.

ChartQA value extraction What is the difference in value between Lamb and Corn?
SPIQA figure lookup Which pooling method is the most accurate for the AFEW database?
Multimodal QUD figure-text inquiry Why does the attention shift from text to image after the intervention?

Project links

Paper Data Code soon Model soon

arXiv preprint BibTeX

@misc{wu2026multimodalqud,
  title={Multimodal QUD: Inquisitive Questions from Scientific Figures},
  author={Wu, Yating and Rudman, William and Govindarajan, Venkata S. and Dimakis, Alexandros G. and Li, Junyi Jessy},
  year={2026},
  eprint={2604.23733},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.23733}
}

Existing figure QA often asks what is directly visible. Multimodal QUD asks what scientific readers wonder next.

From extracting answers to asking why

Filter the sample gallery