1,200 labeled actions. One reproducible decision path.
The benchmark measures the stable, non-executing policy/check path. It validates fixture classification, not complete safety.
Current checked-in result
Allow
400 expected / 400 correct
Precision 100% / Recall 100%
Warn
400 expected / 400 correct
Precision 100% / Recall 100%
Block
400 expected / 400 correct
Precision 100% / Recall 100%
Coverage
Twelve balanced categories cover Git reads, tests and linting, file reads, shell inspection, package publishing, scoped SQL deletion, privilege escalation, destructive Git history, secret access, protected force pushes, destructive SQL, and broad filesystem deletion.
Methodology boundary
Each command is unique, labeled, and evaluated without execution. The report includes accuracy, per-decision precision and recall, confusion, false-safe rate, overblock rate, and category results. The older 230-case runtime compatibility suite remains separate.
This benchmark does not prove complete command coverage, sandbox isolation, guaranteed interception, or governance of commands that bypass Termyte.