Architecture
SkillCI is a set of small, single-responsibility modules under src/<module>/,
each with colocated *.test.ts, all compiling against the shared contracts in
src/core/contracts.ts.
Module map
| Module | Responsibility | Key exports |
|---|---|---|
core |
Canonical types + zod schemas for the whole domain. | Task, ConfigSet, ObjectiveCheck, AgentRunResult, Score, Verdict, Comparison, Thresholds, AgentAdapter |
artifacts |
Discover & normalize agent config; diff config sets. | discoverConfigSet, applyConfigSet, diffConfigSets |
sandbox |
Isolated fixture workdirs, command exec (timeout + process-group kill), file diffs. | createSandbox, withSandbox, LocalSandboxBackend |
agents |
Agent adapters + availability/error helpers. | ClaudeCodeAdapter, CursorAdapter, CodexAdapter, MockAgentAdapter, getAdapter |
tasks |
Load & validate task suites and fixtures. | loadTasks, getSampleTasks |
scoring |
Objective checks, LLM judge (SDK/CLI), cost, composite. | runObjectiveChecks, judgeWithLLM, claudeCliJudge, costMetrics, composite, computeScore |
compare |
Aggregate scores → deltas → verdict; promotion rule. | compareOutcomes, shouldPromote |
report |
Render JSON / markdown / colorized terminal reports. | renderTerminalReport, renderMarkdownReport, renderJsonReport |
pr |
Open a GitHub PR via gh, gated on verdict (dry-run default). |
PR opener |
orchestrator |
Wire a full baseline-vs-candidate run. | runEvaluation, runDemo |
cli |
The skillci run|validate|tasks commands. |
bin entry |
Data flow (one run)
discoverConfigSet(baseline) ─┐
discoverConfigSet(candidate) ┘
│
for each task: ▼
┌─────────────── runOneSide(baseline) ───────────────┐
│ createSandbox(fixture) → applyConfigSet │
│ → adapter.run(claude -p …) → AgentRunResult │
│ → computeScore: │
│ runObjectiveChecks · judge · costMetrics │
│ → composite → Score │
└────────────────────────────────────────────────────┘
(same for candidate, scored RELATIVE to baseline cost)
│
compareOutcomes(baseline, candidate, thresholds)
│
Comparison { verdict, regressions, deltas }
│
renderReport(…) + shouldPromote() → pr.open() / dry-run
│
CLI exit: non-zero iff verdict === 'regressed'
Design principles
- One source of truth. Every module compiles against
core/contracts.ts(types + zod). Schemas validate at the boundaries (task loading, config discovery, judge output) so bad data fails fast with a typed error. - Offline by default. The
MockAgentAdaptermakes the entire pipeline run with no network/keys — that's what tests and the demo use, and the fallback when a real agent is unavailable. - Graceful degradation. Real adapters and the judge never throw to crash a run — missing CLI/key → typed unavailability or a dropped (re-weighted) dimension.
- Fail-closed gate.
comparetreats any objective drop, over-threshold per-task drop, dropped task, or non-finite score as a hard regression; promotion requiresimprovedwith none of those. - Sandboxes are disposable. Each run gets a fresh recursive copy of the
fixture (minus
.git/node_modules) underos.tmpdir(), always disposed — even on throw (withSandbox). Command timeouts kill the whole process group so a forked grandchild can't keep the run alive. - Pluggable backends/backplanes.
SandboxBackend(local today, container later),AgentAdapter(per agent), and the judgeJudgeFn(SDK or CLI) are all swappable behind interfaces.
Testing model
- Unit tests are colocated and offline. CLIs are never spawned for real in unit
tests —
execais mocked to pin invocation contracts. compare,composite, and the gate are exhaustively unit-tested (deterministic).- The offline demo (
npm run demo) is an end-to-end smoke that doubles as a CI gate. See CI Integration.