Writing Tasks
A task is one unit of work the agent attempts, plus how to score the result.
Tasks live as task.json files, each beside the fixture repo it operates on.
Anatomy
A task directory:
my-tasks/
fix-typo/
task.json
README.md # + any other fixture files the task needs
task.json:
{
"id": "fix-readme-typo",
"title": "Fix the typos in README.md",
"agent": "claude-code",
"fixtureDir": ".",
"prompt": "README.md has two typos: 'libary' -> 'library' and 'recieve' -> 'receive'. Fix both and change nothing else.",
"checks": [
{ "kind": "fileExists", "path": "README.md" },
{ "kind": "fileContains", "path": "README.md", "substring": "library" },
{ "kind": "fileContains", "path": "README.md", "substring": "receive" }
],
"timeoutMs": 60000
}
Fields
| Field | Type | Notes |
|---|---|---|
id |
string | Unique within the suite. |
title |
string | Human-readable. |
agent |
claude-code | cursor | codex |
Which agent this task targets. |
fixtureDir |
string | Path to the fixture repo, relative to the task file's own directory ("." = this dir). Copied fresh into a sandbox per run. |
prompt |
string | Handed to the agent headlessly. |
checks |
ObjectiveCheck[] | Evaluated after the run (default []). |
judgeRubric |
{ criteria: string } |
Optional. Enables the LLM judge for this task. |
timeoutMs |
int > 0 | Per-task wall-clock budget (default 120000). |
The fixture's
.gitandnode_modulesare not copied into the sandbox.
Objective checks
Deterministic pass/fail. Four kinds (discriminated on kind):
// 1. command — run a shell command in the sandbox; pass on the expected exit
{ "kind": "command", "cmd": "npm run lint", "expectExitZero": true }
// 2. fileExists — a path (relative to the sandbox) must exist
{ "kind": "fileExists", "path": "src/util/slugify.ts" }
// 3. fileContains — a file must contain a substring
{ "kind": "fileContains", "path": "src/util/slugify.ts", "substring": "export function slugify" }
// 4. testSuite — run a test command; passes when it exits 0
{ "kind": "testSuite", "cmd": "npm test" }
expectExitZero defaults to true; set it false to assert a command fails.
Design checks so they are unambiguous and config-independent — they should
verify the outcome, not how the agent got there.
Judge rubric
Add a rubric to score quality beyond pass/fail. Keep criteria specific and behavior-focused:
{
"judgeRubric": {
"criteria": "Did the agent make the minimal change? Penalize touching unrelated files, missing edge cases, or undocumented exported functions."
}
}
When a judgeRubric is present and a judge backend is available, the run is
scored qualitatively (weight 0.4). With no rubric, the judge dimension is
simply omitted and objective alone carries the base score. See Scoring.
Tips for tasks that discriminate
The point is to tell a good config from a bad one. A task only does that if the config can plausibly change the outcome:
- Pick tasks where guidance matters (conventions, edge cases, "don't do X").
- Mix objective checks (the floor) with a rubric (the nuance).
- Avoid tasks a strong agent always passes regardless of config — they produce
neutraland tell you nothing. - Keep fixtures small; the sandbox is copied per run, twice per task.
Loading your suite
skillci tasks --tasks ./my-tasks # list & sanity-check
skillci run --agent claude-code --tasks ./my-tasks \
--baseline ./base --candidate ./cand
Each task.json is validated against the schema on load; an invalid task fails
fast with a clear error.