A training-free testbed for dense supervision

Evaluate a reward signal
before you train.

Long-horizon LLM agents take hundreds of actions, yet an outcome-only reward tells them nothing about the steps along the way. QVal measures whether a dense supervision signal is value-aligned — whether it orders actions the way reference Q-values do — so signal quality can be judged on common ground, cheaply, and apart from the engineering of any training pipeline.

Read the paper Get started Code Datasets

Sergio Hernández-Gutiérrez · Matteo Merler · Ilze Amanda Auzina · Joschka Strüber · Ameya Prabhu · Matthias Bethge
Tübingen AI Center · Fondazione Bruno Kessler · arXiv preprint coming soon

qval — evaluate a candidate signal

The tool

A testbed for the post-training community

What QVal measures, why value-alignment is the right target, and how to install, run, and extend it on your own methods and environments.

The bottleneck

Dense supervision methods — from intrinsic confidence and self-distillation to embedding similarity and generated code — all try to score intermediate steps. But the field evaluates them by dropping each into a full post-training pipeline and reading off the downstream accuracy gain.

That is expensive, it conflates the quality of the signal with the engineering choices around it — the algorithm, normalization, loss integration, interactions with other signals — and it makes methods that require different training setups impossible to compare. As a result, dense supervision methods are rarely measured on common ground.

“Can we evaluate dense supervision signals in isolation, before expensive post-training runs?”

The idea: value-alignment

A dense supervision method assigns a scalar score k(s, a) to each action. QVal measures one property: whether that score is a strictly increasing function of the reference Q-value — i.e. whether it ranks decisions the way their eventual return does.

k(s, a) = φ(Q^π(s, a)) where φ is strictly increasing

The reference Q-value is the expected return of continuing from (s, a) under a strong reference policy — a scripted optimal policy or planner where one exists, a frontier model where it does not. Predictions live on incompatible scales — raw scores, code outputs, log-probabilities, embedding distances — so we compare the ordering they induce, reporting Spearman ρ against the reference labels. Same inputs, same targets, fixed backbones: the signal is judged on its own.

How the testbed works

The same five-stage recipe runs for every environment — which is exactly what makes QVal cheap to run and easy to extend.

Collect trajectories. Roll agents out in diverse multi-turn environments and record their decision points.
Sample state–action pairs. Draw (s, a) pairs along the trajectories as evaluation points.
Label with reference Q-values. Restore each state, force the action, follow the reference policy, and record the discounted return — a Max-Value Monte Carlo estimate.
Collect method predictions. Run the method under test to score every pair.
Evaluate by rank correlation. Report Spearman ρ between the method's scores and the reference labels.

For one sampled point

Collect & sample

pluck a decision point

→ (s, a)
Label · MVMC

best of k rollouts

→ Q*(s, a)
Predict

the method scores it

→ k(s, a)

Measure value-alignment

ρ Spearman
value-aligned

From trajectories to a value-aligned score — one sampled pair, repeated across the dataset.

Quickstart

QVal is plug-and-play. The benchmark catalog ships pre-populated and its datasets come pre-collected, so evaluating the built-in methods on the built-in environments is four commands — and you never hand-write a config. One catalog is the single source of truth: the configs are generated from it, and the final step discovers and pairs every result on its own.

install

# clone llenvs alongside this repo, then:
uv pip install -e ".[all]"

1 · generate configs from the catalog

python scripts/generate_configs.py

2 · predict — methods + Monte-Carlo ground truth

python scripts/pipeline/predict.py --config \
  catalogs/qval_benchmark/configs/prediction/frozen_lake/100pt_8x8_q35-27-or_text.yaml

3 · evaluate — no arguments, auto-paired

# discovers every prediction, pairs each
# GT/method combo, writes the correlations
python scripts/pipeline/evaluate.py

Models are served through the backends declared in shared/configs/backends.yaml — add an API key (e.g. OpenRouter) or point at a local vLLM server. For full runs, the wrappers in scripts/slurm/ submit each step as a cluster job.

Extend the testbed

QVal is a tool first. Everything is anchored on one registry catalog (catalogs/qval_benchmark/catalog.py): add a method or an environment by appending a spec there and regenerating with generate_configs.py — it then runs on every applicable model, environment, and modality, with no other plumbing.

+ A new method

A method turns a (state, action, next_state) step into a scalar. Subclass DenseSignalMethod and implement evaluate (override evaluate_batch if it batches), register its type in method_factory.py, then append a MethodSpec to the catalog and regenerate.

my_method.py

class MyMethod(DenseSignalMethod):
    def evaluate(self, point: EvaluationPoint) -> float:
        # point.state / action / next_state / history;
        # self.context: task & reward text, signal_type …
        return float(...)        # math.nan on failure

+ A new environment

Environments come through llenvs, a unified, stateless interface (env.step(state, action) is pure) that makes Monte-Carlo rollouts from arbitrary states possible. Write an environment-context YAML, add an EnvironmentSpec to the catalog, collect a dataset once, and regenerate.

shared/configs/environments/my_env.yaml

adapter: my_env               # an llenvs adapter
env_name: my_env:eval
max_steps: 40
reward_signal_name: task_completion
task_description: >-
  The agent acts in …
evaluator_extractor:          # method text → number
  type: numeric

The study

Applying QVal at scale

We point the testbed at the field — 21 methods across seven families, six backbones, and four environments, over 1.2K experiments — and report what actually aligns with value.

Four environments at release

Spanning text and vision, from closed-action navigation to open-ended shell use.

FrozenLake grid world — **FrozenLake**goal-directed navigation

ALFWorld household scene — **ALFWorld**embodied reasoning

OpenApps application UI — **OpenApps**computer & application use

TerminalBench shell session — **TerminalBench**programming & agentic terminal

QVal-v1.0 — the benchmark

To demonstrate the testbed, we benchmark 21 dense supervision methods spanning seven families across six open-weight backbones (9B–122B) and four environments, in text and vision — over 1.2K evaluation experiments. Explore every method's value-alignment below.

21methods

7families

6backbones

4environments

1.2K+experiments

0training runs

What we found

The interactive table above holds the full numbers; here is what they add up to. Every result below holds across model sizes, environments, and observation modalities.

Main results

Simple prompting is the strongest baseline.

Ranking and direct value prediction achieve the highest value-alignment in every environment and backbone, consistently out-aligning more recent and more specialized dense-supervision methods. Direct value elicitation should be treated as a required baseline, not an afterthought.

Performance clusters by family.

Correlations fall into tight bands within each of the seven families, which supports the taxonomy: the family a method belongs to predicts its behavior better than its individual design tricks. Code methods are the exception, with by far the widest spread — their effectiveness hinges on how readily a state and action space can be captured in code.

Added complexity rarely helps.

Within a family, elaborate variants don't reliably beat the simplest one. Multi-estimate and batched/sequential direct prompting don't clearly beat direct-single; privileged self-distillation (sdpo-gt) doesn't improve on plain sdpo; averaging generated functions (codegen-avg) lifts the mean over codegen only slightly while leaving the variance. QVal exposes this without running a single training job.

Difficulty doesn't predict alignment.

From closed-action FrozenLake to open-ended TerminalBench, value-alignment does not fall monotonically with task difficulty. Direct prompting stays positive everywhere; code and ranking weaken in open-ended settings — code even turns negative on TerminalBench — while self-distillation does the opposite, growing stronger where richer intermediate feedback is available. Alignment depends on a method's interaction with each environment, not on difficulty alone.

Robustness

Text beats vision.

On environments that provide both, methods recover reference values more reliably from text than from images. Parsing pixels is harder, and the extra visual context does not compensate for the model–method pairs we examine — though this reflects the abstraction available to each modality, not an intrinsic inferiority of visual feedback.

Rankings survive the choice of target.

Relabelling with state-values V(s) instead of Q-values largely preserves the method ordering, so the conclusions don't hinge on a single target. Absolute scores do shift — code and pre-trained methods align better with state-values, direct prompting with Q-values — reflecting how each method consumes its input.

Labels are robust to the reference backbone.

On TerminalBench, reference values estimated by two independent frontier models (GPT-5.5 and Claude Opus 4.7) yield closely matching method correlations. The labels capture a stable notion of downstream progress rather than one model's idiosyncrasies.

What it means

Signal quality is conflated with training recipes.

Because plain prompting is so competitive when measured directly, much of the reported progress of complex methods may stem from changes in data, compute, exploration, prompting, or optimization rather than from a better dense signal. Measuring alignment first separates the two.

A low score rules a method out; a high score is a green light, not a guarantee.

QVal measures signal quality on its own and leaves integration — normalization against other signals, the optimizer, interactions with the environment — as a separate downstream question. It is a cheap diagnostic that filters candidate signals before expensive training runs, not a replacement for them.

Authors & citation

Sergio Hernández-Gutiérrez¹, Matteo Merler^2,*, Ilze Amanda Auzina^1,*, Joschka Strüber¹, Ameya Prabhu^1,†, Matthias Bethge^1,†

¹ Tübingen AI Center, University of Tübingen · ² Fondazione Bruno Kessler · * equal contribution † equal advising
Correspondence: sergio.hernandez@bethgelab.org

BibTeX

Loading…

Paper PDF Code Datasets

Evaluate a reward signalbefore you train.