Artificial intelligence systems have advanced significantly in emulating human-like reasoning, especially in mathematics and logic. These models not only provide answers but also outline logical steps that lead to these conclusions, enhancing understanding of the processes involved in problem-solving. Known as Chain-of-Thought (CoT) reasoning, this approach has become essential for tackling complex tasks.
A persistent issue with these models is inefficiency during inference, as they may continue processing after reaching a correct solution, producing unnecessary tokens and elevating computational costs. It’s unclear if these models can internally discern when an intermediate answer is correct. If models had this capability, they could halt processing sooner, thus increasing efficiency without sacrificing accuracy.
Current strategies often involve assessing a model’s confidence via verbal prompts or analysis of multiple outputs. These are typically black-box methods, querying the model about its confidence level. Despite their use, they can be imprecise and require substantial computational power. In contrast, white-box methods delve into the model’s hidden layers to find signals that might indicate answer correctness, suggesting that internal states could reflect both intermediate and final solution validity—a yet underexplored area.
Addressing this, researchers from New York University and NYU Shanghai developed a lightweight probe—a simple two-layer neural network—to evaluate a model’s hidden states during intermediate reasoning steps. Their experiments with models like the DeepSeek-R1-Distill series and QwQ-32B, known for CoT reasoning, covered various datasets in mathematics and logic. The probe was trained to interpret hidden states for each reasoning segment and judge if the intermediate answer was correct.
Their methodology involved splitting long CoT outputs into chunks, using markers such as “wait” or “verify” to signify logical breaks. They used the final token’s hidden state from each chunk as a representative sample, evaluated for correctness with another model. This data trained the probe for binary classification tasks, fine-tuned using hyperparameter grid searches, often converging on linear probes—suggesting linear embedding of correctness information within hidden states. The probe successfully predicted correctness even before answer completion, indicating anticipation capabilities.
The results were notable, with probe performance achieving ROC-AUC scores over 0.9 in some datasets like AIME when using models like R1-Distill-Qwen-32B. Expected Calibration Errors (ECE) were kept below 0.1, confirming high reliability. For instance, R1-Distill-Qwen-32B showed an ECE of only 0.01 on GSM8K and 0.06 on MATH datasets. The confidence-based early exit strategy realized during inference used the probe’s confidence to halt reasoning. With a confidence threshold of 0.85, accuracy was maintained at 88.2%, with a 24% reduction in inference token count. At a 0.9 threshold, accuracy was 88.6%, with a 19% token reduction. This dynamic strategy outperformed static methods by up to 5%, using the same or fewer tokens.