What One View Reveals, Another Conceals: 3D-Consistent Visual Reasoning with LLMs
Abstract
Maintaining semantic label consistency across multiple views is a persistent challenge in 3D semantic object detection. Existing zero-shot approaches that combine 2D detections with vision-language features often suffer from bias toward non-descriptive viewpoints and require a fixed label list to operate on. We propose a truly open-vocabulary algorithm that uses large language model (LLM) reasoning to relabel multi-view detections, mitigating errors from poor, ambiguous viewpoints and occlusions. Our method actively samples informative views based on feature diversity and uncertainty, generates new label hypotheses via LLM reasoning, and recomputes confidences to build a spatial-semantic representation of objects. Experiments on controlled single-object and multi-object scenes show double digit improvement, in accuracy and sampling rate over ubiquitous fusion methods using YOLO, CLIP, and other LLM-based baselines. We demonstrate in multiple settings that \textbf{L}LM-guided \textbf{A}ctive \textbf{D}etection and \textbf{R}easoning (LADR) balances detail preservation with reduced ambiguity and low sampling rate. We provide theoretical convergence analysis showing exponential convergence to a stable and correct semantic label.
Cite
Text
Kushnir and Freund. "What One View Reveals, Another Conceals: 3D-Consistent Visual Reasoning with LLMs." Transactions on Machine Learning Research, 2026.Markdown
[Kushnir and Freund. "What One View Reveals, Another Conceals: 3D-Consistent Visual Reasoning with LLMs." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/kushnir2026tmlr-one/)BibTeX
@article{kushnir2026tmlr-one,
title = {{What One View Reveals, Another Conceals: 3D-Consistent Visual Reasoning with LLMs}},
author = {Kushnir, Dan and Freund, László},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/kushnir2026tmlr-one/}
}