Home / Technology / Star Wars Embarks on a Tactical Odyssey with Zero Company Strategy Game

Technology

Star Wars Embarks on a Tactical Odyssey with Zero Company Strategy Game

abril 14, 2025 4:57 pm

Recent advancements in artificial intelligence have significantly enhanced systems’ ability to emulate human-style reasoning, especially in fields like mathematics and logic. These models not only generate answers but also outline a logical sequence of steps to derive conclusions, providing a deeper understanding of how solutions are reached—this method is often referred to as Chain-of-Thought (CoT) reasoning. As such, it has become a critical component in enabling machines to tackle complex problem-solving tasks.

One prevalent challenge faced by researchers using these models is inefficiency during the inference process. Often, these reasoning models continue processing after arriving at the correct answer, resulting in unnecessary token generation and higher computational costs. The question remains: do these models possess an inherent sense of correctness? If they could internally recognize when an intermediate answer is correct, they could stop processing sooner, enhancing efficiency without sacrificing accuracy.

Traditional strategies often evaluate a model’s confidence in its answers through verbal prompts or by comparing multiple outputs. These black-box methods require the model to express its certainty, yet they tend to be imprecise and computationally demanding. Conversely, white-box techniques explore internal hidden states to identify signals potentially linked to the correctness of answers. Previous studies indicate that a model’s internal state can reflect the validity of final answers, but applying this understanding to intermediate steps in long reasoning sequences remains largely unexplored.

Addressing this research gap, a team from New York University and NYU Shanghai developed a lightweight probe—a simple two-layer neural network—to analyze a model’s hidden states at the intermediate stages of reasoning. The DeepSeek-R1-Distill series and QwQ-32B models, renowned for their step-by-step reasoning abilities, were employed for experimentation across datasets involving mathematical and logical tasks. The probe was trained to assess the internal state associated with each reasoning chunk and predict whether the current intermediate answer was correct.

The researchers segmented each extended CoT output into smaller chunks, using indicators like “wait” or “verify” to delineate reasoning pauses. They utilized the hidden state of the last token in each segment as its representative, correlating it with a correctness label as judged by another model. This setup enabled training the probe on binary classification tasks. Fine-tuning was done using grid search across hyperparameters, such as learning rate and hidden layer size, with most models converging to linear probes—revealing that correctness signals are often linearly embedded in hidden states. The probe was capable of confirming correctness before a complete answer was formed, suggesting potential predictive abilities.

The results were robust and quantifiable. The probes achieved ROC-AUC scores surpassing 0.9 in some datasets, such as AIME with models like R1-Distill-Qwen-32B. Expected Calibration Errors (ECE) stayed below 0.1, indicating high reliability. For instance, R1-Distill-Qwen-32B showed an ECE of 0.01 on GSM8K and 0.06 on MATH datasets. The probe was employed to implement a confidence-based early exit strategy during inference by halting the reasoning process once the probe’s confidence exceeded a specific threshold. With a confidence threshold of 0.85, accuracy remained at 88.2%, while the