Concepts
The core ideas behind SkillCI and how a verdict is decided.
Config artifacts
The things SkillCI tests are agent configuration artifacts — the files that steer a coding agent's behavior:
- Claude Code:
CLAUDE.md,.claude/skills/**,.claude/hooks/**,.claude/commands/*.md,.claude/settings.json. - Cursor:
.cursor/rules/*.mdc,.cursorrules. - Codex:
AGENTS.md/ codex config.
A directory of these, discovered and normalized, becomes a ConfigSet.
Baseline vs. candidate
- Baseline — the current, trusted config. The control.
- Candidate — the proposed change. The treatment.
SkillCI runs the same task suite under each and compares. This A/B structure is what lets it attribute a behavior change to the config change rather than to noise — though see the caveat on variance.
Tasks & fixtures
A task is a unit of work the agent attempts:
- a fixture repo (copied fresh into an isolated sandbox per run),
- a prompt handed to the agent headlessly,
- objective checks evaluated after the run,
- an optional judge rubric,
- a per-task timeout.
See Writing Tasks.
The three scoring dimensions
Every run is scored on three axes, folded into one composite in [0, 1]:
| Dimension | Weight | Meaning |
|---|---|---|
| Objective | 0.6 |
Deterministic checks: did files/commands/tests come out right? |
| LLM-as-judge | 0.4 |
Rubric-scored quality for what objective checks can't capture. |
| Cost / efficiency | ±0.1 adj. |
Tokens/steps/time — a relative bonus or penalty vs. the baseline run. |
Objective and judge form a weighted base (the judge is re-weighted out when absent, so objective alone still scores). Cost is a small adjustment on top. See Scoring for the exact formula.
The verdict
The comparator aggregates per-task composites into one of three verdicts:
| Verdict | Meaning |
|---|---|
improved |
Candidate is better, with zero hard regressions. |
neutral |
No meaningful change. |
regressed |
Candidate is worse on at least one hard dimension. |
The regression gate (fail-closed)
A candidate is blocked from promotion if any of these hold (defaults shown):
- Objective drop — fewer objective checks pass on any task
(
objectiveDropIsRegression: true). - Per-task composite drop beyond
regressionCompositeDrop: 0.05. - Net composite gain
≤ minCompositeGain: 0.01→ at bestneutral, notimproved. - A dropped task or any non-finite score.
Promotion (shouldPromote() → open a PR) requires improved and zero hard
regressions. The gate is deliberately conservative: when in doubt, it does not
promote.
skillci run exits non-zero only on regressed, so a neutral candidate
doesn't break your build — it just doesn't get a PR.
A note on determinism
Objective checks, the composite math, and the gate are deterministic. Real agent runs are not — token cost and the agent's exact output vary run-to-run. A single live run is a useful signal, not a verdict to bet the farm on. For trustworthy automated promotion, average multiple runs and/or widen the cost tolerance (see the roadmap in the README). The offline demo and the test suite are fully deterministic.
Where the pieces live
See Architecture for the module map and data flow.