Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs' Task-Specific Understanding over Test-Taking Strategies

Pham, Thao

doi:10.1609/AAAI.V39I28.35337

Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs' Task-Specific Understanding over Test-Taking Strategies

Thao Pham

AAAI 2025 pp. 29596-29598

doi:10.1609/AAAI.V39I28.35337 /aaai/2025/pham2025aaai-truth/

Abstract

Many existing benchmarks, such as MMLU, are limited to measuring large language models’ (LLM) true task understanding due to their reliance on statistical patterns in the training data. We suggest new approaches to improve how benchmarks can capture task-specific understanding in LLMs, revealing insights into their reasoning ability.

PDF AAAI Semantic Scholar

Cite

Text

Pham. "Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs' Task-Specific Understanding over Test-Taking Strategies." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35337

Markdown

[Pham. "Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs' Task-Specific Understanding over Test-Taking Strategies." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/pham2025aaai-truth/) doi:10.1609/AAAI.V39I28.35337

BibTeX

@inproceedings{pham2025aaai-truth,
  title     = {{Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs' Task-Specific Understanding over Test-Taking Strategies}},
  author    = {Pham, Thao},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {29596-29598},
  doi       = {10.1609/AAAI.V39I28.35337},
  url       = {https://mlanthology.org/aaai/2025/pham2025aaai-truth/}
}