s1: Simple test-time scaling • Nadya Yuki Wangsajaya

Today, we will be reading this paper that claims their 32B model can compete with OpenAI's o1-preview model in performance.

Summary of the Paper

To understand what is this particular brand of test-time scaling, we first must understand what is test-time scaling.

In classic LLMs, much focus is on the train-time scaling. If a model is trained over a larger corpus, then it will be better. Meanwhile, in test-time scaling, the focus is during the inference. Extra compute is given to the model to improve the performance. Test-time scaling is the secret behind openAI's o1 model, and why it performs so well in reasoning.

The paper introduces Budget Forcing to control the amount of test-time compute

If the model generates more thinking token, force it to stop by appending <end of thinking> token
If the model does not generate enough thinking token, remove the <end of thinking> token and append 'Wait'

It also introduces a new dataset, the s1K:

Only contains 1000 reasoning samples. Training on the superset of 59K samples does not offer substantial gains
These samples must be
- High quality: remove all API errors, bad formatting, non-existent image references, inconsistent numbering
- Difficult: based on (1) model performance. they chose questions where both Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct fail to answer. (2) reasoning trace length, which is the number of tokens used in reasoning
- Diverse: Using Claude 3.5 Sonnet, they classify each question to a specific domain. Then, they randomly sample one of the domains, and then sample one problem using a distribution that favors longer reasoning trace. Repeat until 1k samples
The three 'guiding principles' above is essential to produce a good model

Some interesting thing to note regarding Budget Forcing (BF):

Appending 'Wait' gives the best performance. This is tested against no additional string, 'Alternatively', and 'Hmm'
Token-conditional control does not work, since the model used (Qwen2.5-32B-Instruct) cannot count tokens reliably. BF is needed.
Step-conditional control does not change the number of token, just the number of step (lmao)
Class-conditional control, i.e. telling the model to think more, works
Increasing the budget for BF actually results in a worse performance
- Hypothesis: There is a correlation s.t. shorter generations tend to be the ones where model was on the right from the get go
- Longer generation, thus, tend to be the ones where model make mistakes and backtrack a lot
- LLMs, like humans, also overthink apparently

Limitations of test-time scaling

The improvement plateaus. Yes, increasing BF will improve the accuracy, but it steadily flattens out
Context window is restrictive. With longer reasoning length, context window must accommodate it

My thoughts

I was quite surprised when I read a 32B model approaches o1-preview in performance. After reading the paper, I am not convinced, unless someone can reliably replicate the data. I mean, 1k dataset + SFT on a small open-sourced model => o1-preview? C'mon, that's quite hard to believe.

My suspicion also lies in the 'difficulty' of the benchmark. They used AIME, MATH500, GPQA Diamond, all of which contains very difficult questions. If the questions are easy, I suspect BF enforces 'overthinking' in the LLM, resulting in worse performance. Maybe the way forward is to be able to determine what is the best 'budget' for BF to improve LLM's performance across different tasks.

Good read, nonetheless. If you have interesting papers to read, send it to yukiwuki07@gmail.com. Bye!