Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness
Abstract
We introduce “pointer-guided segment ordering” (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model’s ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.
Cite
Text
Hillebrand et al. "Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024. doi:10.1007/978-3-031-70359-1_23Markdown
[Hillebrand et al. "Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024.](https://mlanthology.org/ecmlpkdd/2024/hillebrand2024ecmlpkdd-pointerguided/) doi:10.1007/978-3-031-70359-1_23BibTeX
@inproceedings{hillebrand2024ecmlpkdd-pointerguided,
title = {{Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness}},
author = {Hillebrand, Lars and Pradhan, Prabhupad and Bauckhage, Christian and Sifa, Rafet},
booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
year = {2024},
pages = {386-402},
doi = {10.1007/978-3-031-70359-1_23},
url = {https://mlanthology.org/ecmlpkdd/2024/hillebrand2024ecmlpkdd-pointerguided/}
}