(Zhao et al. May 2025) introduce a self-play reinforcement learning paradigm dubbed Absolute Zero, designed to entirely remove dependence on human-label reasoning traces and even curated prompt-response pairs altogether. Instead, a unified language model both proposes tasks and solves them, with the code execution environment used to generate a verifiable source of reward.

The method, Absolute Zero Reasoner (AZR), trains a model to recursively improve its reasoning ability by playing against itself in three different tasks: deduction (predict output from code + input), abduction (infer plausible inputs from code + output), and induction (synthesize programs from input + output). Each interaction yields a reward: $r_\text{solve}$ simply signals whether the answer is correct, and $r_\text{propose}$ if the task provides a good signal for learning — i.e., it is neither too easy nor unsolvable. Advantage estimation is computed independently per-task and per-role. They maximize

$$ \mathbb{E}{r\sim \pi\text{propose}} \left[ r_\text{propose} + \lambda\mathbb{E}{y\sim\pi\text{solve}} \left[ r_\text{solve} \right] \right] $$

The training loop is seeded with just the identity function (lambda x: x), meaning the initial task is to simply echo the input. Over time, the model builds a rich curriculum entirely via its own generations and maintains a replay buffer to increase proposer diversity. The joint objective smoothly regulates both proposed-problem difficulty ($r_\text{propose}$) and solver ability ($r_\text{solve}$). By encouraging the generations to increase in difficulty at the pace of the solver, they require no manually-tuned curriculum schedules or external data.

Screenshot 2025-06-16 at 12.52.43 PM.png

Despite zero in-domain supervision, AZR achieves +13.2% over its 14B base model (7B: +10.2%) on a blend of code and math reasoning tasks — surpassing models trained with curated examples — e.g., ACECoder-RM with 22k coding traces and PRIME-Zero with trained on 484k math examples. The authors find that all tasks are complementary and removing any single one severely degrades end performance. Notably, the synthesized programs often include inline comments that precede code blocks — resembling explicit planning or step-by-step reasoning before execution. This kind of trace is typically expensive to collect via human annotation, so its emergence during self-play is especially promising. It is unclear, however, whether this behavior arises in experiments with just the pre-trained base model or is a byproduct of instruction tuning for general code clarity.

Since $r_\text{propose}$ can be thought of as a ceiling on how much the solver can improve in the (near) future by training on the proposed tasks, its ability to bootstrap good supervision is paramount. The authors note that the replay buffer is crucial to seed the proposer and improved reasoning of the problem space (-5% drop on math without it). Moreover, as a regulator of the proposer, the solver’s initial coding competency appears to be a necessary catalyst for reasoning — applying AZR on Qwen2.5-7B-Coder yields +15.2% improvement even on math, compared to +10.9% from the base model.