PixelWorld: Towards Perceiving Everything as Pixels

Abstract

Recent agentic language models increasingly accept raw camera pixels rather than tokenized text, underscoring the need for a unified perception paradigm. We explore this idea through Perceive Everything as Pixels (PEAP) and release PixelWorld, a benchmark that renders natural-language, tabular, mathematical and diagrammatic inputs into a single pixel space. Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks, indicating that a vision transformer can capture global textual semantics without explicit tokens. In contrast, reasoning-intensive benchmarks (math and code) exhibit sharp performance drops; however, Chain-of-Thought prompting partially mitigates this gap, hinting that explicit reasoning traces compensate for the missing token structure. We also observe that scenarios with tightly intertwined visual--text cues benefit from the unified pixel view, reducing preprocessing overhead and ambiguity relative to split-modality baselines. PixelWorld therefore provides a compact yet challenging yardstick and encourages wider adoption of PEAP for holistic evaluation of next-generation vision–language agents.

Cite

Text

Lyu et al. "PixelWorld: Towards Perceiving Everything as Pixels." Transactions on Machine Learning Research, 2025.

Markdown

[Lyu et al. "PixelWorld: Towards Perceiving Everything as Pixels." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/lyu2025tmlr-pixelworld/)

BibTeX

@article{lyu2025tmlr-pixelworld,
  title     = {{PixelWorld: Towards Perceiving Everything as Pixels}},
  author    = {Lyu, Zhiheng and Ma, Xueguang and Chen, Wenhu},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/lyu2025tmlr-pixelworld/}
}