Scoring

How a single run becomes a number, and how two runs become a verdict.

The composite

Each (task, config) run yields a composite score in [0, 1], combining three dimensions:

obj01   = objective checks passed / total            (1.0 when a task has no checks)
judge01 = judge.score0to1                             (omitted when no rubric/judge)

base    = (W_OBJ · obj01 + W_JUDGE · judge01) / (W_OBJ + W_JUDGE_eff)
composite = clamp(base + costAdjustment, 0, 1)

Constants (src/scoring/composite.ts):

Constant	Value	Role
`W_OBJ`	`0.6`	Objective weight.
`W_JUDGE`	`0.4`	Judge weight. `W_JUDGE_eff = 0` when there's no judge score, so objective alone fully determines `base`.
`COST_WEIGHT`	`0.1`	Magnitude cap of the cost adjustment.

So objective correctness dominates, the judge refines it, and cost nudges.

Objective dimension

runObjectiveChecks executes every check against the post-run sandbox and returns { passed, total }. obj01 = passed/total; a task with no checks scores a neutral 1.0 on this axis (it can't fail what it doesn't assert).

Judge dimension

Best-effort and optional. Returns { score0to1, rationale } or undefined. Two interchangeable backends:

SDK (judgeWithLLM) — Anthropic SDK, prompt-cached. Needs ANTHROPIC_API_KEY.
CLI (claudeCliJudge) — runs claude -p --output-format json. Works on a Claude subscription with no API key. Pure text reasoning: no sandbox, no --dangerously-skip-permissions.

If neither is available, the dimension is dropped (re-weighted out) — judging never throws to break a run.

Cost dimension

costMetrics projects tokens, tool calls, steps, and wall-clock from the run. The candidate's cost is scored relative to the baseline run (reference): cheaper → positive adjustment, pricier → negative, bounded by COST_WEIGHT. The baseline itself is scored with a neutral cost term.

Cost is the noisy axis on live runs — see Concepts → determinism.

From scores to a verdict

The comparator sums per-task composite deltas and applies the thresholds (ThresholdsSchema, all overridable):

Threshold	Default	Effect
`minCompositeGain`	`0.01`	Net gain must exceed this to be `improved` (else `neutral`).
`regressionCompositeDrop`	`0.05`	A per-task composite drop beyond this magnitude is a hard regression.
`objectiveDropIsRegression`	`true`	Any drop in objective checks passed on any task is a hard regression, regardless of gains elsewhere.

Verdict

improved — net gain > minCompositeGain and zero hard regressions.
neutral — no meaningful change.
regressed — at least one hard regression (objective drop, per-task drop beyond threshold, dropped task, or non-finite score).

Promotion

shouldPromote() returns true only for improved with zero hard regressions — the single condition under which the pr module will open a PR. The gate is fail-closed: ambiguity never promotes.

Tuning

Pass custom thresholds through the orchestrator/programmatic API when the defaults don't fit your risk tolerance — e.g. raise minCompositeGain to demand a larger win, or widen regressionCompositeDrop to tolerate small wobbles. The hard objective rule should generally stay on.