Home / AI / EPFL’s New Framework Unlocks Fully Unsupervised Adaptation in Foundation Models

EPFL’s New Framework Unlocks Fully Unsupervised Adaptation in Foundation Models

abril 20, 2025 10:16 pm

Foundation models, large-scale neural networks trained on diverse text and image datasets, have profoundly changed how artificial intelligence systems approach language and vision tasks. Rather than being designed for one specific task, these models utilize their pre-training to perform a wide array of functions by generalizing their knowledge. After training, they can produce coherent responses, classify images, or solve problems without requiring additional task-specific training, making their scalability and versatility fundamental to AI advancement.

However, a notable challenge remains in adapting these models for new, unencountered tasks. Typically, achieving high performance necessitates the use of meticulously crafted prompts or labeled examples to direct the model’s behavior. This challenge introduces a layer of complexity, as crafting prompts can be a trial-and-error process, and gathering labeled data can be both costly and time-intensive. Moreover, in practical scenarios, such support data may not always be readily accessible, restricting the effectiveness of foundation models in zero-shot contexts.

Various techniques have been developed to address this balance between model generality and task-specific performance. In-context learning allows models to simulate a task by incorporating example input-output pairs during inference, while supervised fine-tuning adjusts model parameters using labeled data. Another approach, prompt engineering, involves constructing prompts that guide the model toward desired outputs. Although these methods have succeeded in enhancing performance, each depends on external support—either human input or labeled data—making them less suitable for entirely unsupervised scenarios.

Researchers from the Swiss Federal Institute of Technology Lausanne (EPFL) have introduced a joint inference framework to enable unsupervised adaptation. This framework allows foundation models to make coordinated predictions over multiple inputs without needing ground truth data or manual prompts. The research team proposed two techniques within this framework: unsupervised fine-tuning and unsupervised in-context learning. These methods allow models, even those with closed weights like GPT-4, to enhance accuracy without external guidance.

The unsupervised fine-tuning approach iteratively refines the model’s predictions using its feedback alone. It establishes an optimization goal where predictions for a batch of inputs are generated collectively, maximizing their joint probability. This method utilizes LoRA (Low-Rank Adaptation) for efficient weight updates and includes a regularization step to prevent trivial solutions, such as predicting the same output for all inputs. The researchers developed unsupervised in-context learning for environments where weight access is restricted, such as with GPT-4. This technique emulates the effect of labeled in-context learning by using previously generated outputs as pseudo-labels, refining predictions over iterations without human annotations. Each iteration involves conditioning the model on prior examples, generating more precise answers, and simulating a supervised learning loop through self-generated data.

The improvements from these unsupervised methods were significant. On the GSM8K dataset, tailored for math reasoning, applying unsupervised in-context learning to the Qwen2.5-Math model resulted in a 39.2% absolute improvement over the standard zero-shot baseline. Similarly, for the Llama-3.1-8B model evaluated across 13 natural language processing tasks, unsupervised fine-tuning yielded an average accuracy gain of 23%, matching fully supervised fine-tuning performance on 6 out of the 13 tasks. In vision-language tasks, unsupervised in-context learning demonstrated strong performance as