Brand managers are asking a question that their existing tools can't answer: does our brand appear when someone asks an AI about our product category? Traditional monitoring — press mentions, social listening, review tracking — doesn't touch this. And the commercial tools that have emerged to fill the gap mostly share the same structural weaknesses that make them unreliable at enterprise scale.
This post covers what to measure, why, how to structure the pipeline, and where the hard problems still are. The approach here is based on working through these problems hands-on — not benchmarking tools from the outside.
What LLM brand monitoring actually is
Traditional brand monitoring tracks mentions — where your brand name appears in human-generated content. LLM brand monitoring is a different problem: when a language model generates a response to a relevant query, does it include your brand, and how does it characterize it?
LLMs don't retrieve content the way a search engine does. A model like ChatGPT or Gemini produces a synthesized response reflecting its training data and, for retrieval-augmented models, what it fetches in real time. Your brand's presence in that response depends on how clearly and authoritatively it's represented in the sources those models learned from — not just your search ranking. A brand can have strong organic positions and very low AI visibility, or the reverse. The optimization levers are different, and you need separate instrumentation for each.
The four metrics that matter
Four metrics give a complete enough picture of LLM brand presence to be actionable. Everything else we've tried was either too noisy or collapsed into one of these four.
Coverage in practice
Coverage is the hardest to measure reliably because it depends entirely on query selection. You need a representative sample of what real users actually ask LLMs in your category — not just the branded queries you already track in your SEO stack. Get the query corpus wrong and every other metric is measuring the wrong thing.
Start with a base corpus from a keyword intelligence source (SEMrush, DataForSEO), then use an LLM to generate conversational variants — comparative framing, "which is better" phrasing, scenario-based questions. After deduplication and intent clustering, somewhere in the 800–1500 range per brand per cycle is workable. The goal is statistical coverage of the intent space, not exhaustiveness.
Frequency and position
Frequency matters more than it first appears. Being the third brand mentioned in a list of five is a different signal from being the primary recommendation. Scoring mentions positionally — first mention at full weight, subsequent mentions at diminishing weight — produces a weighted frequency score more correlated with actual user behavior than raw counts.
Sentiment accuracy
Sentiment accuracy is where things get operationally interesting. LLMs generate factually incorrect characterizations of brands more often than you'd expect — wrong use cases, outdated pricing, misattributed features. A secondary evaluation pass — using an LLM to score each brand mention against a ground-truth fact sheet — catches these. Mischaracterizations can then be flagged and, where possible, traced back to the source content that introduced the error.
Citation source
Citation source applies specifically to retrieval-augmented models: Perplexity, Bing Copilot, and Google AI Overviews (though each cites differently — Perplexity averages around 20 inline citations per response, Bing Copilot around 7, Google AI Overviews provides source lists but inline attribution is inconsistent). When a model cites a source alongside your brand mention, it tells you which content is doing the work and where brand authority in LLMs actually comes from. Storing these citations over time builds a clear picture of the citation ecosystem around your brand.
Why most SaaS tools fall short at enterprise scale
There are now dozens of commercial LLM brand monitoring tools — the space has grown fast, with OpenAI's acquisition of Promptfoo in early 2026 signaling further consolidation ahead. Most share three structural problems.
Sampling bias. Most tools use a fixed, generic query set rather than a brand-specific, intent-stratified sample. You end up measuring visibility on queries that don't reflect actual user behavior in your category. The numbers look clean but the denominator is wrong.
No longitudinal data. Brand visibility in LLMs changes as models are updated, as the content ecosystem shifts, as competitors publish more authoritative content. Most tools are snapshots, not time series. The right approach is storing every query response with full timestamps and model version metadata, so you can answer "when did this change, and what changed around the same time?" — which is the only question that actually matters once you have a baseline.
No custom entity sets. Enterprise clients care about specific product lines, how sub-brands are characterized, how they compare to specific competitors on specific use cases. Off-the-shelf tools give you what they've pre-built. It rarely matches the reporting hierarchy a real brand team uses.
Query set design is probably 60% of the problem. A sophisticated storage and analytics layer on top of a badly designed query corpus produces very precise garbage. This is where most early implementations go wrong.
A pipeline architecture that works
Five stages, with the design decisions that matter at each one.
ReplacingMergeTree, partitioned by date and brand. Keep the full raw response alongside the structured extract — what you want to analyze will change, and re-parsing historical data is expensive. A llm_entity_mentions materialized view pre-computes coverage and frequency daily so the dashboard layer doesn't hit the raw tables on every load.What actually moves the needle
Entity clarity matters more than most things. LLMs represent brands better when those brands have clear, unambiguous entity definitions across the web: a maintained Wikipedia page, consistent structured data on your own site and in third-party listings, a brand name that doesn't blur into another entity. If your brand name is a common word or shared with another company, the model will conflate them in ways that are hard to fix without fixing the underlying content ambiguity first.
Structured data has a real effect. Schema.org markup reaches LLM training pipelines — via Web Data Commons and similar corpora — and Microsoft has confirmed that Bing uses schema markup to inform how its LLMs understand content semantically. For retrieval-augmented models, it also improves how structured content is indexed and surfaced. The discipline of writing well-structured content produces cleaner training signal, which is part of why SEO-best-practice pages tend to get better AI representation than poorly structured ones.
Third-party citation is the highest-leverage lever. The sources that consistently appear in retrieval-augmented responses alongside positive brand mentions follow a clear pattern: industry analyst reports, high-authority review platforms (G2 — which acquired Capterra and Software Advice from Gartner in early 2026 — Gartner Peer Insights, major editorial outlets), and well-cited independent coverage. Research puts brands at roughly 6.5x more likely to be cited via third-party sources than through their own domain. The practical implication: if you want to improve LLM visibility, the highest-ROI investment is accurate, prominent coverage in the sources LLMs already trust — not more content on your own site.
What's still unsolved
Three problems remain open, and I don't think anyone has clean solutions to them yet.
The biggest open problem is attribution from AI visibility to business outcome. We can measure whether your brand appears in LLM responses, how often, how prominently, how accurately. What we can't yet reliably measure is whether that visibility translates into conversions, signups, or pipeline. The path from "user asked ChatGPT, got a response mentioning Brand X" to "became a customer" is invisible — to us and, as far as I can tell, to everyone working in this space.
The second problem is model version tracking. Storing model version metadata with every record helps, but the major providers update continuously without always announcing what changed. Distinguishing "visibility dropped because the content ecosystem shifted" from "visibility dropped because the model was updated" is genuinely hard. Canary queries with known expected outputs can detect model drift — it's imprecise, but it's the most practical approach available right now.
Finally, cross-model results are messier than expected. A brand's visibility profile on GPT-4o often diverges from its profile on Gemini or Claude — not just in magnitude but in which query types trigger the brand to appear. Different training data and different architectures produce different priors, but there's no reliable way to predict where those gaps will show up ahead of time. The only practical answer is to monitor all major models rather than picking the most popular one as a proxy for the rest.
If you're working on something in this space, I'm happy to compare notes — reach out on LinkedIn.