A training-free testbed for dense supervision
Evaluate a reward signal
before you train.
Long-horizon LLM agents take hundreds of actions, yet an outcome-only reward tells them nothing about the steps along the way. QVal measures whether a dense supervision signal is value-aligned — whether it orders actions the way reference Q-values do — so signal quality can be judged on common ground, cheaply, and apart from the engineering of any training pipeline.
The tool
A testbed for the post-training community
What QVal measures, why value-alignment is the right target, and how to install, run, and extend it on your own methods and environments.
The bottleneck
Dense supervision methods — from intrinsic confidence and self-distillation to embedding similarity and generated code — all try to score intermediate steps. But the field evaluates them by dropping each into a full post-training pipeline and reading off the downstream accuracy gain.
That is expensive, it conflates the quality of the signal with the engineering choices around it — the algorithm, normalization, loss integration, interactions with other signals — and it makes methods that require different training setups impossible to compare. As a result, dense supervision methods are rarely measured on common ground.
“Can we evaluate dense supervision signals in isolation, before expensive post-training runs?”
The idea: value-alignment
A dense supervision method assigns a scalar score k(s, a) to each action.
QVal measures one property: whether that score is a strictly increasing function of the
reference Q-value — i.e. whether it ranks decisions the way their eventual return does.
The reference Q-value is the expected return of continuing from (s, a)
under a strong reference policy — a scripted optimal policy or planner where one exists,
a frontier model where it does not. Predictions live on incompatible scales — raw
scores, code outputs, log-probabilities, embedding distances — so we compare the
ordering they induce, reporting Spearman ρ against the
reference labels. Same inputs, same targets, fixed backbones: the signal is judged on
its own.
How the testbed works
The same five-stage recipe runs for every environment — which is exactly what makes QVal cheap to run and easy to extend.
- Collect trajectories. Roll agents out in diverse multi-turn environments and record their decision points.
-
Sample state–action pairs. Draw
(s, a)pairs along the trajectories as evaluation points. - Label with reference Q-values. Restore each state, force the action, follow the reference policy, and record the discounted return — a Max-Value Monte Carlo estimate.
- Collect method predictions. Run the method under test to score every pair.
- Evaluate by rank correlation. Report Spearman ρ between the method's scores and the reference labels.
For one sampled point
-
Collect & sample
pluck a decision point
→ (s, a) -
Label · MVMC
best of k rollouts
→ Q*(s, a) -
Predict
the method scores it
→ k(s, a)
Measure value-alignment
value-aligned
Quickstart
QVal is plug-and-play. The benchmark catalog ships pre-populated and its datasets come pre-collected, so evaluating the built-in methods on the built-in environments is four commands — and you never hand-write a config. One catalog is the single source of truth: the configs are generated from it, and the final step discovers and pairs every result on its own.
# clone llenvs alongside this repo, then:
uv pip install -e ".[all]"
python scripts/generate_configs.py
python scripts/pipeline/predict.py --config \
catalogs/qval_benchmark/configs/prediction/frozen_lake/100pt_8x8_q35-27-or_text.yaml
# discovers every prediction, pairs each
# GT/method combo, writes the correlations
python scripts/pipeline/evaluate.py
Models are served through the backends declared in
shared/configs/backends.yaml — add an API key (e.g. OpenRouter) or point at a
local vLLM server. For full runs, the wrappers in scripts/slurm/ submit each
step as a cluster job.
Extend the testbed
QVal is a tool first. Everything is anchored on one registry catalog
(catalogs/qval_benchmark/catalog.py): add a method or an environment by
appending a spec there and regenerating with generate_configs.py — it then runs
on every applicable model, environment, and modality, with no other plumbing.
+ A new method
A method turns a (state, action, next_state) step into a scalar.
Subclass DenseSignalMethod and implement
evaluate (override evaluate_batch if it batches), register
its type in method_factory.py, then append a
MethodSpec to the catalog and regenerate.
class MyMethod(DenseSignalMethod):
def evaluate(self, point: EvaluationPoint) -> float:
# point.state / action / next_state / history;
# self.context: task & reward text, signal_type …
return float(...) # math.nan on failure
+ A new environment
Environments come through llenvs, a unified, stateless interface (env.step(state, action)
is pure) that makes Monte-Carlo rollouts from arbitrary states possible. Write an
environment-context YAML, add an EnvironmentSpec to the catalog,
collect a dataset once, and regenerate.
adapter: my_env # an llenvs adapter
env_name: my_env:eval
max_steps: 40
reward_signal_name: task_completion
task_description: >-
The agent acts in …
evaluator_extractor: # method text → number
type: numeric
The study
Applying QVal at scale
We point the testbed at the field — 21 methods across seven families, six backbones, and four environments, over 1.2K experiments — and report what actually aligns with value.
Four environments at release
Spanning text and vision, from closed-action navigation to open-ended shell use.
QVal-v1.0 — the benchmark
To demonstrate the testbed, we benchmark 21 dense supervision methods spanning seven families across six open-weight backbones (9B–122B) and four environments, in text and vision — over 1.2K evaluation experiments. Explore every method's value-alignment below.
What we found
The interactive table above holds the full numbers; here is what they add up to. Every result below holds across model sizes, environments, and observation modalities.
Main results
Simple prompting is the strongest baseline.
Ranking and direct value prediction achieve the highest value-alignment in every environment and backbone, consistently out-aligning more recent and more specialized dense-supervision methods. Direct value elicitation should be treated as a required baseline, not an afterthought.
Performance clusters by family.
Correlations fall into tight bands within each of the seven families, which supports the taxonomy: the family a method belongs to predicts its behavior better than its individual design tricks. Code methods are the exception, with by far the widest spread — their effectiveness hinges on how readily a state and action space can be captured in code.
Added complexity rarely helps.
Within a family, elaborate variants don't reliably beat the simplest one.
Multi-estimate and batched/sequential direct prompting don't clearly beat
direct-single; privileged self-distillation (sdpo-gt)
doesn't improve on plain sdpo; averaging generated functions
(codegen-avg) lifts the mean over codegen only slightly
while leaving the variance. QVal exposes this without running a single training job.
Difficulty doesn't predict alignment.
From closed-action FrozenLake to open-ended TerminalBench, value-alignment does not fall monotonically with task difficulty. Direct prompting stays positive everywhere; code and ranking weaken in open-ended settings — code even turns negative on TerminalBench — while self-distillation does the opposite, growing stronger where richer intermediate feedback is available. Alignment depends on a method's interaction with each environment, not on difficulty alone.
Robustness
Text beats vision.
On environments that provide both, methods recover reference values more reliably from text than from images. Parsing pixels is harder, and the extra visual context does not compensate for the model–method pairs we examine — though this reflects the abstraction available to each modality, not an intrinsic inferiority of visual feedback.
Rankings survive the choice of target.
Relabelling with state-values V(s) instead of Q-values largely
preserves the method ordering, so the conclusions don't hinge on a single target.
Absolute scores do shift — code and pre-trained methods align better with
state-values, direct prompting with Q-values — reflecting how each method consumes
its input.
Labels are robust to the reference backbone.
On TerminalBench, reference values estimated by two independent frontier models (GPT-5.5 and Claude Opus 4.7) yield closely matching method correlations. The labels capture a stable notion of downstream progress rather than one model's idiosyncrasies.
What it means
Signal quality is conflated with training recipes.
Because plain prompting is so competitive when measured directly, much of the reported progress of complex methods may stem from changes in data, compute, exploration, prompting, or optimization rather than from a better dense signal. Measuring alignment first separates the two.
A low score rules a method out; a high score is a green light, not a guarantee.
QVal measures signal quality on its own and leaves integration — normalization against other signals, the optimizer, interactions with the environment — as a separate downstream question. It is a cheap diagnostic that filters candidate signals before expensive training runs, not a replacement for them.