Governance benchmark

1,200 labeled actions. One reproducible decision path.

The benchmark measures the stable, non-executing policy/check path. It validates fixture classification, not complete safety.

Current checked-in result

Allow

400 expected / 400 correct
Precision 100% / Recall 100%

Warn

400 expected / 400 correct
Precision 100% / Recall 100%

Block

400 expected / 400 correct
Precision 100% / Recall 100%

Coverage

Twelve balanced categories cover Git reads, tests and linting, file reads, shell inspection, package publishing, scoped SQL deletion, privilege escalation, destructive Git history, secret access, protected force pushes, destructive SQL, and broad filesystem deletion.

Methodology boundary

Each command is unique, labeled, and evaluated without execution. The report includes accuracy, per-decision precision and recall, confusion, false-safe rate, overblock rate, and category results. The older 230-case runtime compatibility suite remains separate.

This benchmark does not prove complete command coverage, sandbox isolation, guaranteed interception, or governance of commands that bypass Termyte.