SkillCI scores and gates changes to your skills, hooks, rules, and
CLAUDE.md — running a real agent twice, baseline vs. candidate, and blocking the merge on regressions.
A baseline-vs-candidate run, scored three ways, emitting a verdict — fully offline and deterministic.
Yet teams ship changes to the configuration that steers their coding agents — on
vibes, with zero testing. A new line in a shared CLAUDE.md
can quietly make the agent slower, pricier, or worse — and nobody notices for weeks. Agent config is a production
artifact now. SkillCI is regression testing for it.
Run a task suite twice, drive a real agent headlessly, score each run, compare, and gate.
┌─ baseline config ─┐ ┌──────────┐
tasks ──▶│ ├─▶ run agent ─▶ score ─┤ compare ├─▶ VERDICT
└─ candidate config ┘ (sandboxed) └──────────┘ improved /
neutral /
regressedObjective correctness dominates, the LLM judge refines it, and cost nudges.
Deterministic pass/fail: command exit codes, file existence and contents, test suites, expected diffs.
Rubric-scored quality for what checks can't capture. Runs via the Anthropic SDK or the claude CLI.
Tokens, tool calls, steps, wall-clock — a relative bonus or penalty against the baseline run.
Deterministic to start — then point it at a real config change, no API key.
# 1 · offline demo — no network, no keys npm install npm run demo # 2 · build, then evaluate a real config change — live npm run build node dist/cli/index.js run \ --agent claude-code \ --baseline ./config/baseline \ --candidate ./config/candidate
regressed.
The agent runs via claude -p; so does the judge when no ANTHROPIC_API_KEY is set.| Agent | Artifacts | Headless invocation |
|---|---|---|
| Claude Code | skills, hooks, slash commands, CLAUDE.md, settings | claude -p … --output-format json |
| Cursor | .cursor/rules/*.mdc, .cursorrules | cursor-agent |
| Codex | AGENTS.md / codex config | codex exec |
Gate every change to your skills, rules, and instructions — before they reach production.