SkillCI
CI for agent configuration

Test your AI config before you trust it.

SkillCI scores and gates changes to your skills, hooks, rules, and CLAUDE.md — running a real agent twice, baseline vs. candidate, and blocking the merge on regressions.

225 tests passing Runs on a Claude subscription — no API key Claude Code · Cursor · Codex
SkillCI demo: a baseline-vs-candidate evaluation producing a verdict

A baseline-vs-candidate run, scored three ways, emitting a verdict — fully offline and deterministic.

The problem

You wouldn't merge untested code.

Yet teams ship changes to the configuration that steers their coding agents — on vibes, with zero testing. A new line in a shared CLAUDE.md can quietly make the agent slower, pricier, or worse — and nobody notices for weeks. Agent config is a production artifact now. SkillCI is regression testing for it.

How it works

Baseline vs. candidate, then a verdict.

Run a task suite twice, drive a real agent headlessly, score each run, compare, and gate.

          ┌─ baseline config ─┐                      ┌──────────┐
 tasks ──▶│                   ├─▶ run agent ─▶ score ─┤ compare  ├─▶ VERDICT
          └─ candidate config ┘   (sandboxed)         └──────────┘   improved /
                                                                     neutral /
                                                                     regressed
  1. A candidate is a proposed change to one or more agent config artifacts.
  2. SkillCI runs sandboxed tasks twice — baseline vs. candidate — driving the agent headlessly in an isolated fixture.
  3. Each run is scored on three dimensions and folded into one composite.
  4. A comparator emits a verdict with an explicit list of any regressions.
  5. A pull request opens only when the candidate is improved with zero hard regressions.
Scoring

Three dimensions, one composite.

Objective correctness dominates, the LLM judge refines it, and cost nudges.

Objective 0.6

Deterministic pass/fail: command exit codes, file existence and contents, test suites, expected diffs.

LLM judge 0.4

Rubric-scored quality for what checks can't capture. Runs via the Anthropic SDK or the claude CLI.

Cost ±0.1

Tokens, tool calls, steps, wall-clock — a relative bonus or penalty against the baseline run.

Quickstart

Offline in seconds. Live on your subscription.

Deterministic to start — then point it at a real config change, no API key.

# 1 · offline demo — no network, no keys
npm install
npm run demo

# 2 · build, then evaluate a real config change — live
npm run build
node dist/cli/index.js run \
  --agent claude-code \
  --baseline ./config/baseline \
  --candidate ./config/candidate
Exit code is the gate — non-zero only when the verdict is regressed. The agent runs via claude -p; so does the judge when no ANTHROPIC_API_KEY is set.
Agents

Cross-agent by design.

AgentArtifactsHeadless invocation
Claude Codeskills, hooks, slash commands, CLAUDE.md, settingsclaude -p … --output-format json
Cursor.cursor/rules/*.mdc, .cursorrulescursor-agent
CodexAGENTS.md / codex configcodex exec

Stop shipping agent config on vibes.

Gate every change to your skills, rules, and instructions — before they reach production.