Large language models (LLMs) have demonstrated impressive capabilities across various text and multimodal tasks. However, certain applications, such as document and video analysis, in-context learning, and scaling at inference time, require the capacity to process long sequences of tokens and reason over them. The fixed context window of LLMs presents a challenge in these scenarios, as crucial information distributed across lengthy documents may be missed. This limitation highlights the need for models that can efficiently manage ultra-long contexts without compromising performance on typical tasks.
Strategies to extend the context of long-context language models can be categorized into three types: exact attention methods, approximate attention methods, and methods involving additional modules. Techniques like Position Interpolation, NTK-aware methods, Dynamic NTK, YaRN, and CLEX improve attention mechanisms by redesigning position embeddings. Recent progress includes models such as GPT-4o, Gemini, and Claude, which support extensive context windows of hundreds of thousands of tokens, albeit their closed-source nature limits reproducibility. Open-source endeavors like ProLong use NTK-aware scaling but require intensive computation, while Gradient employs continued pretraining impacting standard task performance.
Researchers from UIUC and NVIDIA have developed an efficient training approach to create ultra-long context LLMs from aligned instruction models, extending context lengths from 128K to 1M, 2M, and 4M tokens. This method incorporates efficient, continued pretraining tactics to expand the context window while using instruction tuning to maintain instruction-following and reasoning abilities. Their UltraLong-8B model achieves state-of-the-art results across various long-context benchmarks while preserving strong performance on standard benchmarks, thus enhancing short and long context tasks. The research provides a comprehensive analysis of critical design choices and discusses the impacts of scaling strategies and data composition.
The proposed approach involves two main stages: continued pretraining and instruction tuning. Combined, these stages enable efficient processing of ultra-long inputs while maintaining excellent task performance. A YaRN-based scaling approach is used for context extension with fixed hyperparameters of α = 1 and β = 4, rather than NTK-aware scaling strategies. The scale factors are based on the target context length and employ larger scaling factors for RoPE embeddings to handle extended sequences and avoid performance degradation at longer lengths. High-quality SFT datasets from general, mathematics, and code domains are subsampled for training data, and GPT-4o and GPT-4o-mini are used to refine responses and perform thorough data decontamination.
The proposed models excel in long-context retrieval tasks such as the Needle in a Haystack passkey retrieval test. Baseline models like Llama-3-8B-Instruct-Gradient-1048k pass the test, but other models such as Llama3.1-8B-Instruct and Llama-3-8B-ProLong-512k-Instruct have errors. In contrast, the UltraLong models achieve 100% accuracy across all input lengths and depths, demonstrating strong retrieval abilities. The UltraLong models achieve the highest average scores on RULER for inputs up to 512K and 1M tokens, the highest F1 scores on LV-Eval within 128K and 256K token lengths, and the best performance on InfiniteBench. They also maintain robust performance across general, math, and code domains, with average scores of 62.47, 61.06, and 60.95, surpassing the base model’s score of 61.45.