Are Machines Better at Slow Thinking? Unveiling Human-Machine Inference Gaps in Entailment Verification
Abstract
Humans make numerous inferences in text comprehension to understand the meaning. This paper aims to understand the similarities and differences between humans and state-of-the-art Large Language Models (LLMs) in their ability to judge valid inferences. To this end, we leverage a comprehensively curated entailment verification benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises and requiring different types of knowledge. Our findings reveal LLMs’ superiority in multi-hop reasoning across extended contexts requiring slow thinking, while humans excel in simple deductive reasoning tasks. Using these insights, we introduce a fine-tuned Flan-T5 model that outperforms GPT-3.5 and rivals GPT-4, offering a superior open-source LLM for entailment verification. As a practical application, we showcase the efficacy of our finetuned model in enhancing the self-consistency in model-generated CoT rationales, resulting in a 6% performance boost on average across three multiple-choice question-answering datasets.
Cite
Text
Sanyal et al. "Are Machines Better at Slow Thinking? Unveiling Human-Machine Inference Gaps in Entailment Verification." ICLR 2024 Workshops: LLMAgents, 2024.Markdown
[Sanyal et al. "Are Machines Better at Slow Thinking? Unveiling Human-Machine Inference Gaps in Entailment Verification." ICLR 2024 Workshops: LLMAgents, 2024.](https://mlanthology.org/iclrw/2024/sanyal2024iclrw-machines/)BibTeX
@inproceedings{sanyal2024iclrw-machines,
title = {{Are Machines Better at Slow Thinking? Unveiling Human-Machine Inference Gaps in Entailment Verification}},
author = {Sanyal, Soumya and Xiao, Tianyi and Liu, Jiacheng and Wang, Wenya and Ren, Xiang},
booktitle = {ICLR 2024 Workshops: LLMAgents},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/sanyal2024iclrw-machines/}
}