Architecture

SkillCI is a set of small, single-responsibility modules under src/<module>/, each with colocated *.test.ts, all compiling against the shared contracts in src/core/contracts.ts.

Module map

Module	Responsibility	Key exports
`core`	Canonical types + zod schemas for the whole domain.	`Task`, `ConfigSet`, `ObjectiveCheck`, `AgentRunResult`, `Score`, `Verdict`, `Comparison`, `Thresholds`, `AgentAdapter`
`artifacts`	Discover & normalize agent config; diff config sets.	`discoverConfigSet`, `applyConfigSet`, `diffConfigSets`
`sandbox`	Isolated fixture workdirs, command exec (timeout + process-group kill), file diffs.	`createSandbox`, `withSandbox`, `LocalSandboxBackend`
`agents`	Agent adapters + availability/error helpers.	`ClaudeCodeAdapter`, `CursorAdapter`, `CodexAdapter`, `MockAgentAdapter`, `getAdapter`
`tasks`	Load & validate task suites and fixtures.	`loadTasks`, `getSampleTasks`
`scoring`	Objective checks, LLM judge (SDK/CLI), cost, composite.	`runObjectiveChecks`, `judgeWithLLM`, `claudeCliJudge`, `costMetrics`, `composite`, `computeScore`
`compare`	Aggregate scores → deltas → verdict; promotion rule.	`compareOutcomes`, `shouldPromote`
`report`	Render JSON / markdown / colorized terminal reports.	`renderTerminalReport`, `renderMarkdownReport`, `renderJsonReport`
`pr`	Open a GitHub PR via `gh`, gated on verdict (dry-run default).	PR opener
`orchestrator`	Wire a full baseline-vs-candidate run.	`runEvaluation`, `runDemo`
`cli`	The `skillci run\|validate\|tasks` commands.	bin entry

Data flow (one `run`)

discoverConfigSet(baseline) ─┐
discoverConfigSet(candidate) ┘
                             │
   for each task:            ▼
     ┌─────────────── runOneSide(baseline) ───────────────┐
     │ createSandbox(fixture) → applyConfigSet            │
     │ → adapter.run(claude -p …) → AgentRunResult        │
     │ → computeScore:                                    │
     │     runObjectiveChecks · judge · costMetrics       │
     │     → composite → Score                            │
     └────────────────────────────────────────────────────┘
     (same for candidate, scored RELATIVE to baseline cost)
                             │
   compareOutcomes(baseline, candidate, thresholds)
                             │
              Comparison { verdict, regressions, deltas }
                             │
        renderReport(…)  +  shouldPromote() → pr.open() / dry-run
                             │
              CLI exit: non-zero iff verdict === 'regressed'

Design principles

One source of truth. Every module compiles against core/contracts.ts (types + zod). Schemas validate at the boundaries (task loading, config discovery, judge output) so bad data fails fast with a typed error.
Offline by default. The MockAgentAdapter makes the entire pipeline run with no network/keys — that's what tests and the demo use, and the fallback when a real agent is unavailable.
Graceful degradation. Real adapters and the judge never throw to crash a run — missing CLI/key → typed unavailability or a dropped (re-weighted) dimension.
Fail-closed gate. compare treats any objective drop, over-threshold per-task drop, dropped task, or non-finite score as a hard regression; promotion requires improved with none of those.
Sandboxes are disposable. Each run gets a fresh recursive copy of the fixture (minus .git/node_modules) under os.tmpdir(), always disposed — even on throw (withSandbox). Command timeouts kill the whole process group so a forked grandchild can't keep the run alive.
Pluggable backends/backplanes. SandboxBackend (local today, container later), AgentAdapter (per agent), and the judge JudgeFn (SDK or CLI) are all swappable behind interfaces.

Testing model

Unit tests are colocated and offline. CLIs are never spawned for real in unit tests — execa is mocked to pin invocation contracts.
compare, composite, and the gate are exhaustively unit-tested (deterministic).
The offline demo (npm run demo) is an end-to-end smoke that doubles as a CI gate. See CI Integration.

Architecture

Module map

Data flow (one run)

Design principles

Testing model

Data flow (one `run`)