Large language models (LLMs) are increasingly utilized for tackling math problems that simulate real-world reasoning tasks. These models are evaluated on their ability to answer factual questions and manage multi-step logical processes. Math problem-solving serves as a dependable method to assess whether models can extract necessary information, navigate intricate statements, and calculate answers accurately. This area has become crucial in understanding the breadth of AI’s logical and cognitive abilities.
A major issue in this field is how these models respond to inputs that aren’t neatly structured or formatted. Often, the questions LLMs face in practice come with additional background information, irrelevant details, or subtle hints that can mislead them. While models might excel on standard benchmark problems, their capability to discern important information from confusing prompts is still uncertain. This has brought attention to the need to explore how distractions affect their reasoning and whether current models are ready for unpredictable, real-world scenarios.
Previous tools and benchmarks focused mainly on well-structured problem sets, such as GSM8K or MATH. However, newer variations like GSM-Symbolic and GSM-PLUS have started evaluating model performance under symbolic modifications and distractor additions. These tools revealed notable weaknesses in LLMs when faced with slight changes to the problem text. For example, adding a seemingly relevant but logically redundant clause can decrease model accuracy by up to 65%. This finding suggests that models frequently rely on surface patterns rather than true reasoning, prompting further investigation into more realistic and complex testing conditions.
Researchers from the Massachusetts Institute of Technology have initiated a study aimed at assessing how LLMs manage four types of systematic perturbations: irrelevant context, misleading instructions, relevant yet non-essential information, and a combination of the latter two. The team evaluated 13 large language models—both open-source and commercial—using APIs from OpenAI, Anthropic, Cohere, and TogetherAI. Instead of using complete test sets, they sampled 56 data points from the GSM8K dataset for each experiment, ensuring a balanced distribution of reasoning complexity.
To create these altered prompts, researchers added dense and irrelevant contexts like Wikipedia pages or financial reports into the input, occupying up to 90% of the model’s context window. In the misleading scenario, false instructions were included, crafted to deviate the reasoning path without changing the original question. New, factually correct but unnecessary details were added for the relevant context case to test how models dealt with seemingly informative distractions. In the final variant, misleading and relevant distractions were combined, increasing input complexity while observing how this influenced the model output.
The performance dropped most significantly when irrelevant context was introduced. On average, accuracy across all models decreased by 55.89%. Misleading instructions resulted in an 8.52% decline, while relevant context caused a 7.01% reduction. Combining both forms of distractions led to a 12.91% drop in accuracy. Notably, performance did not correlate with model size—larger models like Mixtral-8x22B and Command-R-Plus experienced greater setbacks compared to some smaller models. Furthermore, the number of reasoning steps required in a problem didn’t significantly influence the outcome, indicating that logical complexity wasn’t the primary driver of performance variations.
This study indicates that current large language models, even those with billions of parameters