Most audit products say "we have 60 checks." We say "we have 60 checks and here are their current precision numbers." Every tool's accuracy metric is published daily in a public report — including the bad ones. Nothing to hide, only proof to provide.
No tool ships to production without passing all three. They complement each other: laboratory, industry-standard, and the wild web.
Each tool has dozens of hand-curated test pages: positive (containing the issue), negative (clean), edge (boundary cases). Every night at UTC 02:00 CI runs each tool against its fixtures, compares actual findings against expected codes, and computes precision and recall.
Tools with industry counterparts (Lighthouse, axe-core, Mozilla Observatory, SSL Labs, Google Rich Results Test, IAB CMP Validator) are run against the same 10 sites in parallel with the reference tool. The cohort doesn't change until our tool reaches parity — the same 10 sites get re-run every day.
Lab is lab. Industry tools are industry. But how do tools behave on real customer sites — aggressive CDN, JS-heavy SPAs, badly-coded legacy CMS? Every Monday, 10 random findings per plugin are pulled from last week's production scans, and our in-house Compliance/QA team tags each as ✅ true_positive / ❌ false_positive / ❓ uncertain.
The triad is the base; depending on plugin type more is needed. Determinism checks for LLM-based tools, W3C ACT participation for accessibility, regulatory expert panels for compliance.
Saying "this is compliant, this isn't" isn't enough. How the finding was produced is always visible in the report. The user can answer "how much should I trust this?" from the report itself.
Every finding's JSON output also carries mandatory method (how it was measured), limitations (caveats), and evidence.guidelines (regulator reference) fields. When an auditor asks "how did you produce this finding?" the answer comes straight from the report.
If our plugin drops to 71% precision, the public report shows 71%. We fix the code, rerun the next day, repeat until the curve goes up. We don't delete the old number.
Each daily report stays at /validation-report/archive/<date>.html. "It was 78% in February 2026, 96% in May 2026" — improvement is provable.
The customer contract carries this line: "EvidaLux publishes precision/recall/agreement metrics for every plugin daily in a public report." This is a legal commitment.
Not just HTML — raw JSON too: /api/validation/latest. Competitors and auditors can scriptably access and verify.
60 audit tools. Every precision number, every day, public. Proof before pitch.