EvidaLux Tools Validation Report

Every run of our 64 audit tools — behaviour lock, agreement with external references, and real-site oracles. Methodology over marketing — bad numbers publish too.

Generated: 2026-07-19T03:01:05.785186+00:00

Section 1 — Cross-Tool Agreement

Diff against industry-reference tools on the same 10-site cohort. cohort: 2026-05-cohort1

21 of 22 plugins measured · cohort coverage 91% · median agreement 1.00

PEND rows mean the reference-tool integration is queued. Each integration may use a different per-site match metric (see plugin.metric). Median agreement is taken across implemented rows only, and each row's agreement is over the sites that could be measured — read it next to coverage, not alone. A row whose coverage is at or below 50% cannot be GREEN (band_capped_by_coverage).

Plugin	Reference tool	Band	Agreement	Agreed / Compared	Comparison metric
`accessibility.axe`	Pa11y (htmlcs runner)	GREEN	1.00	10 / 10	hierarchical WCAG agreement (Pa11y htmlcs vs plugin axe-core): T1 filtered-SC Jaccard ≥0.50 · T2 violation-presence boolean · T3 noise tolerance (one side empty ≤ engine noise floor 2). WCAG 4.1.1 (deprecated in 2.2) excluded from both sides.
`compliance.iab_tcf`	httpx + lxml + Set-Cookie scan	GREEN	1.00	8 / 8 2 of 10 sites unmeasured	boolean TCF-surface agreement (independent vantage: Set-Cookie inspection for euconsent-v2 + raw-text JS-marker scan + lxml DOM parse — plugin uses BeautifulSoup + script-body scan only; empty/empty = both agree page has no IAB CMP)
`compliance.iab_tcf_verified`	Playwright + iab-tcf (TC-string decode)	GREEN	1.00	8 / 8 2 of 10 sites unmeasured	cmp_id agreement (independent Playwright probe + iab-tcf library decode of tcString vs plugin's JS-callback cmpId; divergence = CMP misconfiguration. Empty/empty = both unable to probe — common when playwright is missing from env)
`quality.api_test`	schemathesis (OpenAPI / Swagger spec loader)	GREEN	1.00	6 / 6 4 of 10 sites unmeasured	discovered-endpoint set-Jaccard ≥0.40 with mutual-empty match + \|union\| ≤2 small-set noise floor. Reference = schemathesis OpenAPI 2.0 / 3.x loader, probing 13 common spec locations (/openapi.json, /swagger.json, /v3/api-docs, /api-docs, /api/openapi.json, ...); spec content-sniffed (must contain `openapi`/`swagger` token) so HTML catch-alls don't false-positive. Plugin = quality.api_test's `_discover_endpoints` discovery sub-component called DIRECTLY (bypassing the registry runner) so the 50-request rate-limit burst probe doesn't fire — the cross-tool is comparing the discovery surface, and firing the burst against 10 cohort sites daily would add ~500 net-new requests/day of noise without signal. Paths normalised both sides: {param} placeholders + concrete digit / UUID / long-hex segments → {}, lowercased, trailing slash stripped — so spec-driven enumeration and convention-probe heuristics meet at the static-prefix shape. 0.40 threshold (vs 0.50 elsewhere) reflects fundamental engine heterogeneity (spec enumeration vs convention probing); most cohort sites hit mutual-empty match (no public OpenAPI spec AND no responding conventional /api* path).
`quality.lighthouse_perf`	Google PageSpeed Insights API	GREEN	1.00	9 / 9 1 of 10 sites unmeasured	accessibility delta ≤10 vs PSI v5 (strict, DOM-static — the environment-stable axis); performance + best-practices deltas ≤45 each (sanity cap — Lighthouse FAQ documents these categories as throttle/network-divergent between PSI's GCE upstream + production Chromium and our local subprocess; agreement metric reports per-category deltas in detail for transparency rather than averaging the strict a11y delta into the noisy perf/bp deltas)
`quality.owasp_zap_scan`	nuclei cve+exposure+misconfig templates	GREEN	1.00	8 / 8 2 of 10 sites unmeasured	HIGH/CRITICAL presence boolean (nuclei cve+exposure+misconfig templates vs plugin's ZAP spider+active-scan alerts; mirror of vulnerability_nuclei↔ZAP — set-Jaccard on rule IDs is undefined across two engines with disjoint namespaces; agreement = both surface at least one HIGH/CRITICAL OR both are clean of HIGH/CRITICAL)
`quality.responsive_test`	Google PageSpeed Insights API (Lighthouse mobile-friendly audits subset)	GREEN	0.90	9 / 10	3-axis responsive feature-vector agreement ≥2/3 (overflow / small_text / tap_targets). Reference = PSI Lighthouse mobile-strategy audits — content-width (overflow), font-size (small_text), tap-targets — scoring an axis as 'issue' when audit score is non-null AND < 0.9 (mirrors meta_tags charitable projection). Plugin = Playwright Chromium 3-viewport probe (mobile 375×812 / tablet 768×1024 / desktop 1440×900) projected to overflow / small_text axes; tap_targets has no plugin equivalent (always False on plugin side) so it's surfaced for transparency but ≥2/3 threshold means the 2 instrumented axes carry the agreement signal. Plugin runtime_error (Chromium / playwright missing in env) → site _skip with PEND, same env-graceful treatment as iab_tcf_verified.
`quality.vulnerability_nuclei`	OWASP ZAP REST API (baseline + passive)	GREEN	1.00	10 / 10	HIGH/CRITICAL presence boolean (OWASP ZAP spider + passive-scan alerts vs nuclei cve/exposure/misconfig/tech templates; engines and rule sets differ, so cross-tool set-Jaccard is undefined — agreement = both surface at least one HIGH/CRITICAL OR both are clean of HIGH/CRITICAL)
`search.lighthouse_seo`	Google PageSpeed Insights API	GREEN	1.00	9 / 9 1 of 10 sites unmeasured	seo score within ±10 of PSI v5 (mobile+desktop avg)
`security.exposed_files`	nuclei http/exposures/{files,configs} templates	GREEN	1.00	10 / 10	leaked-path set-Jaccard ≥0.50 (nuclei http/exposures/{files,configs} templates vs plugin's custom path-prober + content-sniffer; both emit a set of leaked URL paths per origin; empty/empty = both agree origin has no exposures)
`security.headers`	Mozilla Observatory v2 (MDN)	GREEN	1.00	8 / 8 2 of 10 sites unmeasured	letter grade within one band of Mozilla Observatory v2 (#685). Adjudicated against Observatory + hand curl: the two scales weight headers differently — chiefly a missing CSP, which Mozilla penalises to F where we land at D — so this compares relative posture, not an absolute grade, and a two-band gap is a real methodology difference, not a plugin bug. Sites that answer our fetch with a 4xx/5xx anti-bot wall (their real posture is behind the wall) are dropped, not scored.
`security.tls_deep`	SSL Labs API	GREEN	1.00	9 / 9 1 of 10 sites unmeasured	A-F grade-band agreement within ±1 letter (Qualys SSL Labs API v3 vs plugin's score-deduction grader; engines disagree on HSTS-preload + OCSP-stapling weightings the plugin's v1.0 doesn't probe, so exact-grade equality would over-penalize — the broad health bucket is the defensible signal)
`seo.broken_links`	linkchecker (W3C)	GREEN	0.90	9 / 10	in-domain set-Jaccard ≥0.50 OR union ≤2 small-set noise floor (linkchecker --recursion-level=1 --check-extern; external broken findings counted in transparency but excluded from agreement; small-set floor recognises that heterogeneous crawlers — linkchecker BFS vs our SharedFetcher — legitimately diverge on the single-broken-URL case from JS-injection, sitemap-only references, or anti-bot 403/200 flips)
`seo.canonical_audit`	lxml + httpx canonical-chain follower	GREEN	1.00	10 / 10	4-axis canonical classification agreement ≥3/4 (canonical_present / self_canonical / cross_host / chain_two_hop). Reference = vanilla httpx + lxml clean-room reimplementation of the plugin's chain-follow logic. Independent fetcher (no SharedFetcher etag/cache layer) and parser (lxml vs BeautifulSoup4). Plan §3.2 nominally maps this to Lighthouse SEO subset, but PSI's canonical audit doesn't follow chains — would silently drop the plugin's distinguishing checks; this row's purpose is to cross-tool exactly those.
`seo.freshness`	httpx + lxml independent sitemap lastmod walker	YELLOW	0.89	8 / 9 1 of 10 sites unmeasured	stale_ratio agreement within ±0.20 (httpx + lxml independent sitemap.xml lastmod walker vs plugin's combined sitemap + JSON-LD + HTTP Last-Modified scan; same 540-day stale cut-off, 50-URL cap; clean-room ref intentionally skips secondary signals so any plugin regression in sitemap parsing surfaces in the agreement number)
`seo.hreflang_validator`	langcodes (BCP 47) + lxml	GREEN	1.00	8 / 8 2 of 10 sites unmeasured	set-equality on invalid hreflang codes (langcodes/BCP 47 vs plugin regex; same page parsed independently via lxml vs BeautifulSoup; empty/empty = no hreflang or all codes pass)
`seo.meta_tags`	Google PageSpeed Insights API (Lighthouse SEO audits subset)	YELLOW	0.70	7 / 10	6-axis meta-tag feature-vector agreement ≥5/6 (title / description / viewport / canonical / is-crawlable / html-has-lang). Reference = PSI Lighthouse SEO audits (charitable OR across mobile+desktop strategies); plugin = BeautifulSoup raw-HTML parse. H1-count + Open Graph completeness excluded — no Lighthouse equivalent (kept in plugin output for customer reports). Lighthouse renders real Chromium with JS, so SPA-injected meta tags surface as divergence — the cross-tool's most valuable signal in this row.
`seo.robots_txt_audit`	Google robotstxt parser	GREEN	0.90	9 / 10	boolean homepage indexability agreement (Protego vs plugin verdict on User-agent: *; empty/empty = both report no readable robots.txt)
`seo.sitemap`	httpx + lxml inline-XSD sitemap.org 0.9 validator	GREEN	0.90	9 / 10	sitemap presence boolean parity + URL count delta ≤ max(10% of larger, 2 small-set noise floor). Reference = vanilla httpx + lxml with inline sitemap.org 0.9 XSD (urlset + sitemapindex schemas embedded clean-room, not fetched). Independent fetcher (no SharedFetcher), independent XSD-validation pipeline (plugin only well-formedness-checks). Both sides mirror plugin discovery (robots.txt Sitemap: directives → /sitemap.xml fallback, cap 5 docs) and count top-level <loc> only — no recursion into sitemap-index children — so URL counts are apples-to-apples; ref XSD-valid count surfaced in per-site detail for schema-conformance transparency.
`seo.structured_data`	validator.schema.org	PEND	—	—	Integration queued
`tech.dns_health`	dig +dnssec + Google DNS-over-HTTPS	GREEN	1.00	10 / 10	7-axis DNS posture feature-vector agreement ≥6/7 (AAAA / NS≥2 / SPF / DMARC-present / DMARC-strong / CAA / DNSSEC). Reference = OR of dig (BIND binary, system resolver) and Google DoH (independent resolver via HTTPS); plugin = dnspython. DKIM excluded — selector probing is informational only (selectors are private, miss ≠ absence).
`tech.stack_detection`	python-Wappalyzer (open-source fingerprint engine)	GREEN	0.90	9 / 10	set-Jaccard ≥0.30 within the plugin-detectable tech universe (python-Wappalyzer 2,000+ fingerprints filtered to the ~40 techs our PATTERNS table claims to fingerprint, so detector-coverage gaps — databases, OSes, build tools — don't count as plugin failures; name aliases normalised; small-set floor when filtered union ≤2; full Wappalyzer detections surfaced in ours_only_outside_universe / ref_only_outside_universe for transparency)

Section 2 — Real-World Correctness

Plugin findings on random real sites (US/Europe/Turkey/Arab/South America) scored against a real-browser + HTTP ground truth. Unreachable sites are INCONCLUSIVE (never gate); a gate failure is a false positive/negative on a reachable site.

Region	When	Verdict	Reachable	Inconclusive	FP	FN
arab, europe, south_america, turkey, us	2026-07-19 02:37 UTC	FAIL	88	12	1	0
arab, europe, south_america, turkey, us	2026-07-18 20:31 UTC	PASS	85	15	0	0
arab, europe, south_america, turkey, us	2026-07-18 19:22 UTC	FAIL	84	16	5	0
arab, europe, south_america, turkey, us	2026-07-18 18:18 UTC	PASS	18	7	0	0
arab, europe, south_america, turkey, us	2026-07-18 18:00 UTC	FAIL	20	5	3	0
arab, europe, south_america, turkey, us	2026-07-18 16:09 UTC	FAIL	20	5	44	1
arab, europe, south_america, turkey, us	2026-07-04 06:08 UTC	FAIL	20	5	1	1
arab, europe, south_america, turkey, us	2026-07-03 06:20 UTC	FAIL	19	6	1	1

Latest run — 2026-07-19 02:37 UTC

Plugin	Confirmed	FP	Inconclusive
`accessibility.eaa_mapping`	0	0	174
`compliance.accessibility_statement`	77	0	99
`compliance.age_verification`	0	0	88
`compliance.ai_disclosure`	0	0	88
`compliance.child_consent`	0	0	88
`compliance.cookie_consent`	0	0	198
`compliance.cross_border_transfer`	0	0	176
`compliance.dark_pattern`	88	0	176
`compliance.data_subject_request`	8	0	80
`compliance.dpo_contact`	0	0	88
`compliance.eu_representative`	0	0	88
`compliance.geo_consistency`	0	0	176
`compliance.iab_tcf`	0	0	88
`compliance.iab_tcf_verified`	6	0	82
`compliance.legal_disclosure`	85	0	179
`compliance.odr_link`	88	0	0
`compliance.pay_or_consent_wall`	0	0	88
`compliance.pricing_indication`	0	0	176
`compliance.privacy_policy_content`	37	0	120
`compliance.purchase_disclosure`	0	0	176
`compliance.required_pages`	280	0	72
`quality.api_test`	0	0	225
`security.exposed_files`	165	0	11
`security.headers`	627	0	306
`security.tls_deep`	256	1	90
`seo.broken_links`	33	0	35
`seo.canonical_audit`	52	0	5
`seo.duplicate_content`	0	0	89
`seo.freshness`	35	0	27
`seo.hreflang_validator`	76	0	7
`seo.meta_tags`	163	0	0
`seo.robots_txt_audit`	17	0	1
`seo.sitemap`	86	0	5
`seo.structured_data`	82	0	10
`tech.dns_health`	352	0	176
`tech.stack_detection`	0	0	88

Wrong findings — latest run (gate)

false_positive security.tls_deep/tls.no_https — https://www.isbank.com.tr/ — TLS handshake succeeded independently

EvidaLux Tools Validation Report

Section 1 — Cross-Tool Agreement

Section 2 — Real-World Correctness

Latest run — 2026-07-19 02:37 UTC

Wrong findings — latest run (gate)

Past reports

Legend — column meanings

Cross-Tool Agreement