Why Trust EvidaLux?

3 core methods

Triple-validation architecture

No tool ships to production without passing all three. They complement each other: laboratory, industry-standard, and the wild web.

Method 1

Golden Corpus — Known-truth inputs

Each tool has dozens of hand-curated test pages: positive (containing the issue), negative (clean), edge (boundary cases). Every night at UTC 02:00 CI runs each tool against its fixtures, compares actual findings against expected codes, and computes precision and recall.

✓ Min 5 positive + 3 negative + 3 edge fixtures per tool
✓ Automated diff between expected and actual finding codes
✓ Target: precision ≥ 0.95, recall ≥ 0.90
✓ Below threshold → tool downgrades in Confidence taxonomy or pulls from production

Method 2

Cross-Tool Agreement — Beat the industry standard

Tools with industry counterparts (Lighthouse, axe-core, Mozilla Observatory, SSL Labs, Google Rich Results Test, IAB CMP Validator) are run against the same 10 sites in parallel with the reference tool. The cohort doesn't change until our tool reaches parity — the same 10 sites get re-run every day.

✓ ~25 of our plugins have a reference-tool counterpart
✓ Target: agreement ≥ 0.90; > 15% deviation → DETECTED status
✓ The 10-site cohort is locked until we hit parity
✓ When the reference tool errs: 3-source majority vote (reference + second tool + human review)

Method 3

Production Sampling — Wild-web calibration

Lab is lab. Industry tools are industry. But how do tools behave on real customer sites — aggressive CDN, JS-heavy SPAs, badly-coded legacy CMS? Every Monday, 10 random findings per plugin are pulled from last week's production scans, and our in-house Compliance/QA team tags each as ✅ true_positive / ❌ false_positive / ❓ uncertain.

✓ 600 findings reviewed manually per week (60 plugins × 10 samples)
✓ Reviewer calibration: Cohen's κ ≥ 0.80 mandatory
✓ Production precision published weekly in the public report
✓ Confirmed false positive → becomes a fixture, golden corpus grows

+9 add-on methods

Defense layers stacked on top of the triad

The triad is the base; depending on plugin type more is needed. Determinism checks for LLM-based tools, W3C ACT participation for accessibility, regulatory expert panels for compliance.

5Mutation Testing — controlled degradation, blind-spot detection (monthly)

6Adversarial Test Suite — pages designed to fool the plugin

7Determinism Check — 10× rerun on LLM plugins, Cohen's κ

8Inter-Rater (3 LLMs) — GPT-4o + Claude Opus + Gemini majority vote

9Public Benchmark — W3C ACT-rules, IAB CMP Validator, SchemaOrg test corpus

10Regulatory Expert Panel — biannual external counsel (GDPR + EAA) review

11Drift Detection — daily 3σ deviation alerts on finding rate

12Customer False-Positive Loop — "Is this wrong?" button → fixture feedback

+Confidence Taxonomy — every finding gets a VERIFIED/DETECTED/SUSPECTED badge

Confidence Taxonomy

Every finding's epistemic status is disclosed

Saying "this is compliant, this isn't" isn't enough. How the finding was produced is always visible in the report. The user can answer "how much should I trust this?" from the report itself.

🟢

VERIFIED HTTP/DNS/TLS handshake parse, JSON-LD parse, header read. Deterministic, no judgment.

🟡

DETECTED DOM regex, heuristic substring, HEAD probe. Tolerance margin exists.

🟠

SUSPECTED LLM response interpretation, vision-LLM screenshot review, NLP semantic claim.

⚪

INCONCLUSIVE Plugin ran but data is insufficient. Tells the user "rerun this scan."

🔍

MANUAL_REQUIRED Plugin can't audit; human review needed. Saying "automated check not possible" is honesty.

⚫

OUT_OF_SCOPE Plugin not applicable. E.g. an EU-only plugin is skipped on a US-only site.

Every finding's JSON output also carries mandatory method (how it was measured), limitations (caveats), and evidence.guidelines (regulator reference) fields. When an auditor asks "how did you produce this finding?" the answer comes straight from the report.

Transparency principle

Nothing to hide

Bad numbers also publish

If our plugin drops to 71% precision, the public report shows 71%. We fix the code, rerun the next day, repeat until the curve goes up. We don't delete the old number.

Historical archive

Each daily report stays at /validation-report/archive/<date>.html. "It was 78% in February 2026, 96% in May 2026" — improvement is provable.

Contractual clause

The customer contract carries this line: "EvidaLux publishes precision/recall/agreement metrics for every plugin daily in a public report." This is a legal commitment.

Raw data is public

Not just HTML — raw JSON too: /api/validation/latest. Competitors and auditors can scriptably access and verify.