Nibbs
← Back to Blog
opinionAI & Machine LearningWeb & Internet

I went down a GEO rabbit hole and came out with a scoring framework

A breakdown of the GEO scoring framework I built while researching AI visibility measurement, including the two-track model, the Gemini no-grounding problem, and the two execution mistakes that silently destroy score tracking over time.

I did not set out to build anything. I got curious about Generative Engine Optimisation (GEO) so that I could optimise my blog and IBZim accordingly.

So I decided to research how it actually works under the hood. How do you know if an AI mentions your brand? How would you track it? How would you compare your visibility across ChatGPT, Gemini, Perplexity, and Claude at the same time? One week later I had a scoring framework, a set of formulas, and a very clear picture of why most GEO tools on the market are giving people confident-looking wrong answers. Here is what I found.

What GEO actually is (and why it matters now)

Related reading

GEO is the practice of making sure your brand shows up when someone asks an AI a question related to your space. Think of it as SEO, but instead of
ranking on Google's results page, you are being cited in an AI response.

This matters more than it might seem. A growing number of people are skipping search engines entirely and just asking ChatGPT or Perplexity for

recommendations. If those tools never mention your brand, you are invisible
to a segment of your audience that you are probably not even measuring.

The obvious question, then, is: how do you measure it?
That is where it gets complicated.

The problem nobody warns you about

Firstly, the five major AI models do not return comparable data. At all.

Perplexity and OpenAI give you structured citation lists, ranked source URLs
with positions. Gemini gives you grounding metadata when it decides to search. Claude and Grok mostly give you prose. Just text. No citations, no URLs, nothing structured.

If you send the same query to all five, pull whatever data comes back, and
average it into a single "GEO score," you are mixing fundamentally different
signals as if they are the same thing. A URL cited at position 3 is not the
same as a brand name appearing somewhere in a paragraph.

One is a hard signal.
The other is soft.

Most GEO tools do exactly this. The dashboard looks clean. The score looks
precise. But underneath, the methodology is treating apples and car parks as
the same fruit.

The fix I landed on was splitting measurement into two tracks.

Track A (Citation Score) covers the providers that return structured data:
Perplexity, OpenAI, and Gemini when it actually searches. For each query, you
check whether your domain appears in the citation list and where. Position
matters, so I used a logarithmic decay formula that rewards top placement
without making positions five through ten worthless. Being cited eighth still
means the model read your content and used it. That has value.

Track B (Presence Score) covers Claude, Grok, and ungrounded Gemini. Here
you are doing NLP on the response text. Did the brand get mentioned? How early in the answer? What is the sentiment around it? You run that detection with a fast, cheap model given a strict extraction prompt returning JSON. Not regex. Regex will miss "the publication IBzim" versus "ibzim.com" versus "IBZim" and cannot read sentiment.

The headline GEO Score fuses both tracks, weighted 65/35 in favour of citations because they are the harder and more reliable signal.

The Gemini problem (and why it matters)

This is the finding that surprised me most.

Gemini does not always search, which is funny because it's Google Gemini. On roughly 30 to 40 percent of queries, it
answers from its training data and returns no grounding data whatsoever. No
citations. Nothing.

Most GEO tools read that as "you were not cited."

That is wrong. "Gemini did not search on this query" and "Gemini searched and did not find you" are two completely different situations pointing to two
completely different fixes. Conflating them silently deflates your score and
sends you in the wrong direction.

The correct treatment is to flag no-grounding Gemini responses as presence
track only. You do not penalise the brand for a model behaviour the brand
cannot control.

Trust me, this is the line between a GEO tool that is honest and one that
just looks credible, however we can also aknowledge how the major GEO tools specifically address this and also include GEO scoring for the Google AI Overview found in Search.

The two things that will break your tracking over time

Getting the measurement right is only half the problem. The execution details
will kill your data if you ignore them.

Non-determinism - Run the same query on the same model on different days and you will get different citations. Your score can swing by ten points for
no reason. The fix is to run each query two or three times per scan and average the results. That roughly triples your API cost per scan, which is not nothing, but without it your score is noise.

Query drift - Your GEO score is only meaningful relative to a fixed set of
queries. If the query set changes between scans, even slightly, your
score-over-time chart becomes useless. The score moved because the questions moved, not because your visibility changed. Pin the query set per brand and never silently regenerate it.

I for one think query drift is the less obvious of the two and the more
damaging. You can eyeball a volatile score and suspect something is off. A slowly drifting query set is invisible.

What the final framework produces

A headline score from 0 to 100, broken into five bands:

ScoreLabelWhat it means
80 - 100DominantYou own these queries
60 - 79StrongConsistently surfaced
40 - 59EmergingPresent but inconsistent
20 - 39FaintOccasionally surfaced
0 - 19InvisibleAI does not know you exist

Scoring Framework

Below that, a per-engine breakdown showing where you are cited versus where you are invisible. That is the insight that drives action. "I'm at position 3 in Perplexity but invisible in Gemini" is a sentence that tells you exactly what to fix.

And a query-level table showing which specific questions you won and who beat you on the ones you lost. That is the consultant layer. The line that makes someone actually go and update their content.

Why I did not build it

So, all that research and also taking time to design this framework but not building anything for it... why? Building was never my plan, optimising was, and when there are already some platforms that have built similar frameworks, I decided to stick to optimising my content instead of trying to build an Otterly clone.

The thinking was not wasted. It gave me a clear picture of what
rigorous GEO measurement actually looks like, and what most tools are quietly
getting wrong. If you are evaluating GEO tools right now, the two-track
distinction and the Gemini no-grounding problem are the things worth
scrutinising first. That is where the gap between a dashboard that looks
credible and one that is actually honest tends to live.