GitHub GitHub

Efficient RL for Large Language Models
with Intrinsic Exploration (PREPO)

Yan Sun1,2, Jia Guo2, Stanley Kok1, Zihao Wang2, Zujie Wen2, Zhiqiang Zhang2

1National University of Singapore, 2Ant Group

NeurIPS 2025 Efficient Reasoning Workshop

Paper Paper
TL;DR

PREPO reduces Reinforcement Learning (RLVR) training costs

with "intrinsic" metrics—Prompt Perplexity & Rollout Entropy—to filter data, guiding exploration.

1. The Problem

?
  • Costly: Standard RLVR generates thousands of rollouts.
  • Inefficient: Many samples are too easy or too hard (zero advantage).
  • Goal: Data-efficient RLVR training using data intrinsic properties.

2. Preliminary Analysis

preliminary

3. Method: Online Prompt Selection

Strategy: Prompt Perplexity

Use perplexity as a proxy to select from a candidate batch $\mathcal{B}$ to the actual batch $\mathcal{I}_{\rho}$ at every training step. Train on Low PPL to High PPL prompts.

ppl-schedule.png

4. Method: Rollout Weighting

Strategy: Relative Entropy

Prioritize diverse reasoning paths. Weight rollouts by their average token-level entropy. ($V$: vocabulary size)

$$\bar{H}_i = -\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \sum_{v \in V} p(v|x_{t}) \log p(v|x_{t})$$
rollout_1.png

5. Results

Tested on Qwen & Llama (MATH500, AIME, Olympiad)

Model Method Avg Acc. Rollouts
Qwen2.5-Math-7B
Random 39.45% 905K
PREPO 39.59% 540K (1.7x)
Qwen3-4B
Random 71.33% 553K
PREPO 75.99% 348K (1.6x)
perf_v2.svg