index

dag-runner

A tiny task runner that takes a JSON DAG of shell commands, runs nodes in parallel as their dependencies finish, and renders inline progress. It powers nix run .#health-checks today and is the planned replacement for ix-fleet’s sequential per-node loops.

The runner is meant for short, hands-off batches: spawn a fan-out of independent jobs, follow their progress, and exit with a worst-case status. It is not a long-running supervisor. For the design rationale (why not process-compose, why not devenv-tasks), see the corresponding AGENTS.md section.

Usage

dag-runner <spec.json> [--output auto|tui|plain|json] [--only NAMES]

--output auto (default) picks tui when stdout is a TTY and plain otherwise. json emits NDJSON events to stdout and a final summary line; everything else still goes to stderr.

--only restricts the run to the named nodes (comma-separated, repeatable: --only a,b --only c). Unknown names and edges left dangling by the cut (a kept node depending on a dropped one) are rejected before any node is spawned, so a filtered run keeps the same “every kept node has every dep it needs” invariant as an unfiltered run.

Spec schema

The spec is a single JSON object with a nodes map. Each entry is a node keyed by name.

field type required meaning
command string[] yes argv. command[0] is the program, the rest are arguments. Must be non-empty.
depends_on string[] no, default [] Names of other nodes that must succeed first.
env {string: string} no, default {} Extra env vars layered on top of the runner’s own env. Entries here shadow inherited vars; missing entries are inherited from the parent.
timeout_secs u64 no, default null Wall-clock seconds before the child is SIGTERMed (then SIGKILLed after ~500ms grace). On expiry the outcome is failed with exit code 124 (matches coreutils timeout) and the captured stderr ends with dag-runner: node timed out after Ns.

Validation runs before any node is spawned and rejects:

Nodes are spawned in topological order; siblings without a dependency relationship may run concurrently. When the runner has to break ties (independent roots, or siblings inside one layer), it walks names in lexicographic order so logs stay stable across runs.

Example

{
  "nodes": {
    "fetch":   { "command": ["curl", "-fsSL", "https://example.test/data.json", "-o", "data.json"] },
    "lint":    { "command": ["jq", ".", "data.json"], "depends_on": ["fetch"] },
    "convert": { "command": ["./bin/convert", "data.json", "out.bin"], "depends_on": ["fetch"], "env": { "RUST_LOG": "debug" } },
    "upload":  { "command": ["./bin/upload", "out.bin"], "depends_on": ["lint", "convert"] }
  }
}

lint and convert run in parallel after fetch. upload waits for both. A failure in fetch propagates: lint, convert, and upload all end up skipped.

Output modes

In every mode, after all nodes settle, a one-line summary plus a per-node breakdown (and captured stdout/stderr for any failed nodes) is written to stderr.

--output json event schema

One JSON object per line. Three event shapes, discriminated by event:

{ "event": "node_started",  "node": "fetch", "ts_ms": 12 }
{ "event": "node_finished", "node": "fetch", "outcome": "succeeded", "exit_code": null, "duration_ms": 412 }
{ "event": "node_finished", "node": "lint",  "outcome": "failed",    "exit_code": 1,    "duration_ms": 87  }
{ "event": "node_finished", "node": "upload","outcome": "skipped",   "exit_code": null, "duration_ms": 87  }
{ "event": "summary", "total": 4, "succeeded": 1, "failed": 1, "skipped": 2, "duration_ms": 510 }
field type notes
node string Node name from the spec.
ts_ms u128 Milliseconds since the runner started (only on node_started).
outcome "succeeded" | "failed" | "skipped" Final state. skipped means one of its dependencies did not succeed.
exit_code i32 | null Set when outcome == "failed". null otherwise. A spawn error (binary missing, etc.) surfaces as outcome: "failed" with exit_code: 127.
duration_ms u128 On node_finished, time the runner spent on that node (from spawn to exit, or zero for skipped). On summary, total wall-clock time.

Ordering guarantees:

Exit code

exit_code = max(worst node exit code, 1 if any node was skipped, else 0)

Concretely:

CI pipelines should treat any non-zero exit as a stop signal and read stderr for the per-node breakdown and captured child output.