Methodology · public reference · 2026-05-05

How Arenza measures AI visibility and accuracy.

A plain-language summary of the probe pipeline behind the dashboard. Six AI assistants, four markets, 75 prompts per (brand, market) cell, and a separate accuracy pillar with a verbatim quote behind every wrong claim.

Generative AI assistants now mediate an estimated 30–40% of branded research queries that previously flowed through classical web search. Unlike a search engine, an assistant collapses the answer down to a single paragraph: a buyer no longer scans ten links and decides which to trust — they read one synthesized response and act. That response may or may not mention a given brand, and may or may not state facts about that brand correctly.

This produces a measurement gap. Brands have decades of telemetry for the search surface (rank, impression share, click-through) and roughly nothing for the AI surface. Arenza closes that gap with a probe-based pipeline that measures two distinct properties of how a brand appears in AI answers:

Visibility — whether and how prominently the brand is named when a buyer asks a category question, across the assistants and markets that matter.
Accuracy — whether the assistant's stated facts about the brand match a canonical reference, with severity tiers and the verbatim quote behind every flagged claim.

Credibility in this space comes from the methodology being inspectable, not from marketing copy. This page documents how the pipeline works in enough detail to be independently reproduced or critiqued.

Sampling design

For each tracked brand we populate a coverage grid of (assistant × market) cells. Three assistants are in scope today: ChatGPT (OpenAI), Gemini (Google), and Perplexity. Four markets are in scope: United States (en-US), United Kingdom (en-GB), Germany (de-DE), and Japan (ja-JP). That yields 3 × 4 = 12 cells per brand. Markets are realized through prompt-language localization plus, where the assistant API supports it, a system-locale hint.

Within each cell the pipeline draws between 50 and 200 samples, scaled upward for high-volume brands and downward for cold-start brands where cost dominates marginal precision. The standard normal-approximation interval at the worst-case probability p = 0.5 yields N ≈ 96 for a ±10% margin of error and N ≈ 384 for ±5%; the per-cell sample size is recorded with every result so confidence intervals can be reconstructed downstream rather than asserted by us in marketing copy.

Across a quarterly research cohort of 1,000 brands at the standard probe count, this pipeline produces approximately 24,000 (brand, assistant, market) data points per quarter, which is what the public benchmark reports are built on top of.

Prompt generation

Probes are constructed to mirror the questions a buyer asks during genuine research, not the questions a brand wishes to be asked. Two design rules govern every prompt set:

Unbranded majority. At least 70% of probes per brand do not contain the brand's name, the brand's product names, or any brand-specific jargon. Branded probes return tautological positive responses (the brand is the subject of the question, so it appears in the answer) and therefore measure little about organic surfacing. The 70% floor is enforced programmatically by the prompt-writer agent that drafts the matrix; the unbranded share is logged per brand for audit.
Buyer voice. Probes are written in the first person from the perspective of a buyer with a stated job-to-be-done, not in the third person from an analyst's perspective. “I'm setting up a small video studio and need…” produces measurably different retrieval and different mentions than “list the top brands for video studios.”

Inside each cell, prompts span three orthogonal axes: 5 personas (role, seniority, organization size, budget tier, technical depth, derived from the brand's stated ICP), 5 intent classes (discovery, comparison, validation, problem-solving, purchase), and 3 temperature samples (or three independent draws separated in time, where the assistant API does not expose a temperature control). The Cartesian product yields 5 × 5 × 3 = 75 prompt variants per (brand, market) pair, each issued to each assistant in scope.

Mention extraction

Every returned answer is parsed for brand mentions in two stages.

The first stage is a deterministic regex layer keyed on the brand name and the brand's known aliases — the legal name, common abbreviations, the domain stem, and product family names. This stage is fast, cheap, and recovers the large majority of mentions across the multi-LLM coverage grid.

The second stage is an LLM-judge fallback that runs only when the regex layer is ambiguous (e.g., the brand's name is a common English word like “Apple” or “Square,” or the answer refers to the brand by a paraphrase like “the German camera company that makes the SL2”). The judge is given the answer, the brand's canonical description, and a strict yes/no/uncertain rubric. Judge calls are themselves sampled at low temperature and aggregated by majority. The split between regex-resolved and judge-resolved mentions is recorded per cell so the pipeline's reliance on the more expensive, more drift-prone judge layer is auditable.

Wrong-claim detection (3 severity tiers)

For every mention, the surrounding sentence is parsed for assertions about the brand: founding year, headquarters, product features, pricing, customer base, and other structured facts. Each parsed assertion is compared against a canonical reference for the brand, constructed in priority order from (1) the brand's own llms.txt / llms-full.txt if published, (2) a structured profile maintained by the brand or its agency inside the Arenza dashboard, and (3) public structured-data sources such as Wikidata.

Discrepancies are recorded as wrong claims and assigned one of three severity levels:

Critical (factual error) — a verifiable fact stated incorrectly: wrong founding year, wrong headquarters city, wrong price, wrong product capability. Acted on first because it directly misleads a buyer about a hard fact.
High (positioning error) — a fact stated in a way that materially mischaracterizes the brand's position: a premium offering described as “budget,” a B2B-only product described as consumer, a flagship described as deprecated.
Medium (attribution error) — a fact correctly stated but attributed to the wrong entity: a feature of a competitor attributed to the brand, or vice versa. Important to fix because it dilutes brand-attribution share even when individual statements are technically true.

Every wrong-claim record includes a verbatim quote of the offending sentence and a pointer to the canonical reference contradicting it, so brand-side review is auditable rather than opaque.

Share-of-voice formula

For a brand b, assistant a, market m, the share-of-voice is the intent-weighted mean mention rate across the prompt matrix:

SoV(b, a, m) = Σ_i w_i · ( 1 / |P_i| ) · Σ_{p ∈ P_i} mentions(b, p, a, m) / N_{p,a,m}

where:
  i           ∈ {discovery, comparison, validation, problem-solving, purchase}
  w_i         per-intent weight (uniform by default; configurable per brand)
  P_i         set of prompts in intent class i
  N_{p,a,m}   per-cell sample size for prompt p in (assistant a, market m)

Competitive share-of-voice replaces the numerator with mentions(b, p, a, m) over the sum of mentions across the brand and a configured competitor set, so “our 18% vs their 22%” comparisons are computed against the same prompt matrix and the same sample sizes.

Cross-LLM divergence index

A central observation motivating the multi-LLM design is that the six assistants disagree, sometimes substantially, on which brands surface for a given prompt. Reporting on only one assistant systematically under-reports cross-assistant variance and conceals divergence the brand needs to act upon.

We quantify the disagreement with a cross-LLM divergence index based on the Jensen-Shannon divergence between assistant answer distributions. For a fixed prompt p in market m, let D_a be the empirical distribution over the brands mentioned by assistant a across the cell's samples (treating “no brand mentioned” as a distinct outcome). For each pair of assistants we compute:

JSD(D_a, D_a') = ½ · KL(D_a || M) + ½ · KL(D_a' || M)
M               = ½ · (D_a + D_a')

CLDI(p, m) = ( 1 / 15 ) · Σ_{a < a'} JSD(D_a, D_a')   over all C(6,2) = 15 pairs

The Jensen-Shannon divergence is symmetric, bounded in [0, log 2], and well-defined even for distributions with disjoint support. CLDI close to 0 indicates the assistants surface the same brands at the same rates; CLDI close to log 2 ≈ 0.693 indicates they surface entirely disjoint brand sets. A brand-level CLDI is the prompt-weighted mean over the prompt matrix; a high CLDI brand has a fragmented assistant footprint and is structurally exposed to single-assistant blind spots.

Limitations and transparency

Honest limits of the methodology:

Probe-based, not census. The pipeline measures a structured sample of (prompt, assistant, market) cells, not the population of every answer ever generated. The roll-up estimators are unbiased under the design assumptions, and per-cell N is recorded so confidence intervals can be reconstructed.
Stochasticity is mitigated, not eliminated. Repeating the same probe yields different answers. Per-cell sampling tightens the within-cell estimator, but a single user's single query at a single moment may still observe an outcome away from the cell mean. The methodology measures a distribution, not a guaranteed individual experience.
Model updates break temporal comparability. An assistant's underlying model may change between observation windows, breaking direct period-over-period comparison. We record the assistant identifier and observation timestamp; we do not yet have a principled framework for normalizing measurements across model updates. This is honestly an open problem.
Locale via prompt language plus system locale. Market targeting is realized through prompt-language localization and, where the API supports it, an explicit system-locale hint. This does not fully capture per-user personalized assistant answers driven by sign-in state, location, and prior conversation history.
Wrong-claim detection requires canonical truth. Accuracy measurement compares assistant assertions to a brand's canonical reference. Brands without a published llms.txt, without a maintained dashboard profile, and without a Wikidata presence are reported as a measurement gap rather than imputed.
Sampling biases. The persona / intent matrix is derived from the brand's stated ICP; if the ICP description is wrong, the matrix systematically misweights the buyer-funnel coverage. We mitigate this by surfacing the matrix to the agency for review before the first scan locks.

Collection windows. Quarterly research cohort scans run on a published calendar; ad-hoc per-brand scans run continuously. The dashboard always shows the scan timestamp alongside the score so a number is never read out of context.

LLM versions. The pipeline records the assistant identifier and, where exposed by the API, the underlying model snapshot. Comparing scores across observation windows that span a model update should be done with that caveat in mind; the dashboard surfaces a model-change marker on relevant trend charts.