Scoring
How a single run becomes a number, and how two runs become a verdict.
The composite
Each (task, config) run yields a composite score in [0, 1], combining
three dimensions:
obj01 = objective checks passed / total (1.0 when a task has no checks)
judge01 = judge.score0to1 (omitted when no rubric/judge)
base = (W_OBJ · obj01 + W_JUDGE · judge01) / (W_OBJ + W_JUDGE_eff)
composite = clamp(base + costAdjustment, 0, 1)
Constants (src/scoring/composite.ts):
| Constant | Value | Role |
|---|---|---|
W_OBJ |
0.6 |
Objective weight. |
W_JUDGE |
0.4 |
Judge weight. W_JUDGE_eff = 0 when there's no judge score, so objective alone fully determines base. |
COST_WEIGHT |
0.1 |
Magnitude cap of the cost adjustment. |
So objective correctness dominates, the judge refines it, and cost nudges.
Objective dimension
runObjectiveChecks executes every check against the post-run sandbox and
returns { passed, total }. obj01 = passed/total; a task with no checks
scores a neutral 1.0 on this axis (it can't fail what it doesn't assert).
Judge dimension
Best-effort and optional. Returns { score0to1, rationale } or undefined.
Two interchangeable backends:
- SDK (
judgeWithLLM) — Anthropic SDK, prompt-cached. NeedsANTHROPIC_API_KEY. - CLI (
claudeCliJudge) — runsclaude -p --output-format json. Works on a Claude subscription with no API key. Pure text reasoning: no sandbox, no--dangerously-skip-permissions.
If neither is available, the dimension is dropped (re-weighted out) — judging never throws to break a run.
Cost dimension
costMetrics projects tokens, tool calls, steps, and wall-clock from the run.
The candidate's cost is scored relative to the baseline run (reference):
cheaper → positive adjustment, pricier → negative, bounded by COST_WEIGHT. The
baseline itself is scored with a neutral cost term.
Cost is the noisy axis on live runs — see Concepts → determinism.
From scores to a verdict
The comparator sums per-task composite deltas and applies the thresholds
(ThresholdsSchema, all overridable):
| Threshold | Default | Effect |
|---|---|---|
minCompositeGain |
0.01 |
Net gain must exceed this to be improved (else neutral). |
regressionCompositeDrop |
0.05 |
A per-task composite drop beyond this magnitude is a hard regression. |
objectiveDropIsRegression |
true |
Any drop in objective checks passed on any task is a hard regression, regardless of gains elsewhere. |
Verdict
improved— net gain> minCompositeGainand zero hard regressions.neutral— no meaningful change.regressed— at least one hard regression (objective drop, per-task drop beyond threshold, dropped task, or non-finite score).
Promotion
shouldPromote() returns true only for improved with zero hard
regressions — the single condition under which the pr module will open a PR.
The gate is fail-closed: ambiguity never promotes.
Tuning
Pass custom thresholds through the orchestrator/programmatic API when the
defaults don't fit your risk tolerance — e.g. raise minCompositeGain to demand
a larger win, or widen regressionCompositeDrop to tolerate small wobbles. The
hard objective rule should generally stay on.