Presentation
Beyond End-to-End: Understanding the Limits of LLMs in Scientific Problem Solving
DescriptionMultimodal large language models (MLLMs) are now widely used across many applications, including scientific question answering that requires combining visual and textual inputs. However, existing benchmarks in this area are mostly end-to-end, making it difficult to pinpoint where models fail. To address this gap, we design an evaluation framework that decomposes scientific question answering into subtasks for fine-grained assessment. We evaluate two MLLMs, Gemini 2.5 Pro and Qwen2.5-VL-32B-Instruct, on questions involving high-resolution visual data. Results show that accurate answers are unattainable without scripting or tool use. Although both models can solve individual subtasks, such as mapping cities to coordinates or computing pixel positions, they often fail to integrate these abilities in end-to-end reasoning, producing large deviations. Our findings highlight the importance of benchmarks that expose reasoning bottlenecks and suggest that agent-based or multi-model approaches may be required to achieve reliable performance on complex scientific tasks.
Event Type
Workshop
TimeSunday, 16 November 20252:40pm - 3:00pm CST
Location241
Similar Presentations


