Schema, LLMs & The Low Bar For 'Evidence' In GEO

TL;DR: I ran a small experiment to try to get some perception into whether or not massive language fashions really parse schema markup or are simply nodding politely in its course. I put a pretend firm handle (inside fantastically invalid JSON-LD, on a web page about geese) into the pinnacle of an HTML doc, talked about no handle wherever within the seen textual content, after which requested numerous LLMs the place the corporate was based mostly. They fortunately informed me, a number of of them citing the “structured knowledge” that they had so studiously consulted.

The experiment was then picked up by Search Engine Roundtable, at which level British sarcasm met the LinkedIn carousel, the 2 annihilated one another in a small puff of smoke, and a bit of the GEO neighborhood got here away satisfied I had simply proved that LLMs are lovingly parsing schema precisely as Schema.org supposed.

An AI search engine response demonstrating that Large Language Models read structured schema data. The top section shows a user prompt asking for a company's address from a specific URL. The AI correctly extracts a fictional address ("77 The Muddy Bank, South Pondshire..."). The bottom section shows the source code of the "schema" it read: a humorous, duck-themed JSON-LD script containing custom keys like waddleStyle: "Aggressive", reedNumber: "77", and quackVolume: "Loud". A cartoon duck points down at the code with a shocked expression. — The responsible LinkedIn put up that was affected person zero of schema confusion. Picture Credit score: Mark Williams-Cook dinner

I had arguably proved the alternative. The schema was intentionally damaged. The LLMs returned the information anyway, as a result of so far as they had been involved, the JSON-LD was merely extra textual content on the web page, evenly garnished with curly braces. That distinction is the entire level, as a result of a rising cohort of “GEO experts” is pointing at “the LLM returned info that was solely within the schema” as cast-iron proof that LLMs are utilizing schema as designed. They’re doing nothing of the kind. They’re studying the HTML and shrugging on the construction.

I’m not professing schema is nugatory. I feel it is best to nonetheless use it. However the best way it’s at present being bought to shoppers (as a magical injection of LLM citations) is propped up on a remarkably skinny pile of proof, and I need to stroll via why.

A Fast Refresher On What Schema Is Really For

Schema, or Schema.org structured data, is a collaborative vocabulary constructed by Google, Microsoft, Yahoo, and Yandex to let site owners embed machine-readable information on their pages. The clue is within the title. It’s a schema. A shared, agreed construction that lets a machine know that “Mark Williams-Cook dinner” is a Particular person, that he works at an Group referred to as “Candour,” and that the string “01603 957068” sitting in his profile is a telephoneNumber and never, for example, my weight in grams.

Google’s official documentation puts it about as plainly as Google ever puts anything:

“Structured knowledge is a standardized format for offering details about a web page and classifying the web page content material.” Google additionally says it makes use of structured knowledge “to grasp the content material of the web page, in addition to to assemble details about the online and the world usually, similar to details about the individuals, books, or corporations which might be included within the markup.”

The entire level of schema is to take away ambiguity. Pure language is messy. “Apple” is a fruit, an organization, a report label, and doubtless the surname of somebody’s gerbil. For those who inform a search engine in plain English that you simply promote Apple, it has to guess. For those who inform it in schema that you simply promote an Group referred to as “Apple Inc.” with sameAs linking to Apple’s Wikipedia web page, that ambiguity collapses to nothing. That’s the job. Disambiguation. Specific clues. Machine-resolvable id. It’s, mainly, a well mannered contract between you and a machine saying, “Let’s each agree what this phrase means, simply this as soon as.”

The place does the anomaly really get resolved? In Google’s case, into the Knowledge Graph, the large entity-and-relationships database that powers data panels, “individuals additionally ask,” entity carousels, and 100 different surfaces. Schema is among the inputs. It isn’t the one enter, and it has by no means been the one enter. However it’s a clear, specific, low-noise one, which is why engines like google prefer it.

Proper. That’s what schema does for engines like google. Now to LLMs, that are a special animal in almost each method that issues.

The place, Precisely, Would An LLM Even Use Schema?

There are two camps within the LLM/schema debate, and most arguments collapse into one in all them.

Camp 1: Schema is hoovered up throughout the coaching of the mannequin and finally ends up “baked in” someway.

Camp 2: Schema is learn in the meanwhile the LLM live-fetches a web page (throughout retrieval at question time, or through crawls that feed retrieval).

Let’s take them in flip, with applicable skepticism.

Camp 1: Schema Will get Into Coaching Knowledge

I’ve written about this earlier than, and it was covered by Search Engine Roundtable final 12 months. The quick model is that that is the preferred idea and in addition the one with the weakest mechanical case behind it. There are two issues, and neither of them is small.

Drawback 1: Schema Is Nearly Definitely Stripped Earlier than Coaching

If in case you have not gone down the rabbit gap of how base LLMs are literally made, Andrej Karpathy’s three and a half hour deep dive on LLM pre-training is the canonical reference, and sure, three and a half hours is the deal.

Pre-training pipelines do a whole lot of unglamorous cleansing work earlier than a single GPU sees the information: URL filtering, language filtering, deduplication, elimination of personally identifiable info, and crucially, stripping out HTML and boilerplate. The objective is to not protect the web page. The objective is to extract clear prose that helps the mannequin construct a helpful chance distribution over language. The extra noise (markup, navigation, footers, scripts, JSON-LD, your cookie consent banner) you permit in, the more serious the ensuing mannequin. So that they don’t.

The extensively used FineWeb dataset (15 trillion tokens, derived from 96 Widespread Crawl snapshots) is refreshingly specific. Their pipeline extracts textual content from the WARC information utilizing trafilatura, a library particularly chosen as a result of it produces “the principle web page textual content” with “much less boilerplate and menu textual content” than the alternate options. The information card states: “We then extracted the principle web page textual content from the HTML of every webpage, filtered every pattern and deduplicated every particular person CommonCrawl dump/crawl.” JSON-LD lives in a `

You may fairly ask: then how can ChatGPT write schema markup for me after I ask it? As a result of there are hundreds of thousands of examples of schema in seen prose throughout the online. Tutorials. Documentation. Discussion board posts. GitHub repos and Stack Overflow solutions. Code blocks in weblog posts. The mannequin learns what schema seems to be like the identical method it learns what a Python operate seems to be like, by studying limitless explanations of it, written by people, in paragraphs. The schema in your precise product web page, sitting silently within the head of the doc, doing its correct job, will get thrown straight out.

Drawback 2: Even If It Survived, It Would Not Work The Manner You Assume

Let’s be beneficiant and stipulate that some non-trivial quantity of uncooked schema does sneak right into a mannequin’s coaching knowledge. We don’t even have full transparency from Frontier Labs about what they ingest, and the courts haven’t precisely been type on this level. Meta’s coaching pipeline is at present being picked apart for allegedly using LibGen, a pirate library of around 7.5 million copyrighted books. If the frontier labs are pleased to swallow different individuals’s novels entire, they’re most likely not above swallowing the odd

Even when this had been the case and our treasured JSON-LD schema made it into the coaching knowledge, it might not be unscathed.

Right here’s the catch: The model does not memorize pages. It doesn’t have a little bit submitting cupboard labeled “Candour Company Ltd” with the handle tucked inside. What really occurs is that this:

All of the textual content within the coaching corpus will get chopped into tokens (chunks of characters, typically elements of phrases).
The mannequin is proven billions of small home windows of tokens and requested to foretell the subsequent one.
Every time it will get it incorrect, billions of tiny numerical weights contained in the community are nudged so it might do barely higher subsequent time.
After sufficient nudging, these weights collectively encode a (lossy, blurry, statistical) impression of which tokens are inclined to comply with which different tokens, in what contexts.

That’s what is saved. Weights. Not info. Not addresses. Not your postalCode. A glorified chance distribution that has learn an important deal and remembers, with the identical constancy as somebody making an attempt to recall the lyrics to a track they final heard in 2011, which phrases normally comply with which different phrases.

A screenshot of the OpenAI Platform Tokenizer tool on a dark interface, showing how a JSON-LD structured data script is broken down into individual tokens. At the top left, the counter displays — Your lovely schema, being Dahmerfied. Picture Credit score: Mark Williams- Cook dinner

That is the place schema particularly falls aside. The entire level of schema was to take a string like “77 The Muddy Financial institution” and tag it explicitly as a streetAddress belonging to a PostalAddress belonging to your Group, so a machine can not mistake it for anything. When that JSON-LD is tokenized, the construction dissolves. The string “@kind”: “Group” turns into a sequence of tokens together with @, kind, :, Group, fully indistinguishable, to the mannequin, from the identical phrase soup showing in any weblog put up about schema. The disambiguation, which was the complete purpose for utilizing schema within the first place, is the very very first thing thrown out by the very first stage of coaching. Marvellous.

Worse nonetheless, an LLM solely “recollects” a truth if it has seen it many, many instances. A single point out of your handle on a single product web page is a vanishingly small drop in a fifteen-trillion-token bucket. Even when it survived ingestion, you’ll additionally want the mannequin to come across your streetAddress sufficient instances that these explicit weights really settle right into a helpful sample. For >99.99% of companies, that doesn’t occur. The actual fact isn’t saved. It won’t be recalled. You might be paying a guide to whisper your postcode right into a hurricane.

So, in case you are shopping for the “schema will get baked into the mannequin” idea, you’re shopping for improbabilities in a trench coat: that it survives pre-training cleansing, that it survives tokenization with its construction intact, and that it will get repeated typically sufficient throughout the online for the mannequin to truly “study” it. Not one of the three is clearly true.

Camp 2: Schema Will get Learn At Question Time

I’ve skilled that it’s uncommon for any LLM/schema proponents to need to talk about coaching knowledge involvement as soon as it has been gently set on hearth. The argument tends to maneuver rapidly onto the chance that schema isn’t within the mannequin itself, however is learn in the meanwhile a person asks a query, when the LLM fetches the web page in actual time. Let’s look at the three flavors of this argument in rising order of confidence and distressing degree of inaccuracy.

Taste 1: “Schema Feeds The Data Graph”

Google’s Data Graph is an enormous, curated, slow-moving database of entities and relationships. It’s fed by structured knowledge, Wikipedia, Wikidata, freebase legacy knowledge, and 100 different alerts. It’s constructed and up to date by Google’s pipelines on Google’s schedule. It isn’t assembled on the fly when somebody varieties a query, regardless of how briskly they kind.

The notion that an LLM “builds a knowledge graph in real time when pages are fetched” sounds so much much less cheap if you say it out loud into the mirror. Data graphs are constructed entities. They’ve IDs. They’ve relationship cardinality guidelines. They should be reconciled towards current entries, so you don’t find yourself with three drifting “Apple Inc.” nodes submitting completely different tax returns. None of that occurs between a person urgent enter and the reply showing on display screen. It can not. There’s not sufficient time, and there’s no infrastructure uncovered within the chatbot product to do it.

So if an entity-resolution pipeline exists at any of the frontier labs, it’s being constructed upstream, on an analogous cadence to Google’s, and never throughout your dialog. Which is okay, however it doesn’t match the breathless declare that “your schema feeds the LLM’s mind”. Conceptually, the strongest model is nearer to “your schema could finally feed a curated database that the LLM may someday seek the advice of”. Which is a a lot weaker declare, and one for which there’s, at current, no public proof in any respect.

Taste 2: “Microsoft Confirmed Schema Feeds Copilot”

Misquoted to an industrial scale, Search Engine Land’s write-up ran below the headline “Microsoft Bing/Copilot use schema for its LLMs,” wherein Fabrice Canel of Microsoft was reported to have “confirmed” that schema markup helps Microsoft’s LLMs. Cue half of LinkedIn pasting the headline as proof, typically with out troubling the physique copy.

For those who learn the precise quote, it’s about IndexNow:

“Gen AIs worth contemporary content material particularly, partly as a reference test of their LLM coaching knowledge. Use the API at indexnow.org to push that info because it’s printed or up to date.”
~ Fabrice Canel

It’s “your web page modified, right here is its new state, please come look”. Fabrice was making a degree about freshness (telling search engines when your content has changed to allow them to replace their understanding) and never a degree about JSON-LD being deferentially parsed by GPT-flavored methods. Conflating the 2 is a textbook instance of the trade’s favourite parlor trick: Take a cautious declare about one factor, sand the perimeters off it, and resell it as a daring declare about one thing else completely.

Taste 3: “LLMs Return Data That Was Solely In The Schema, Subsequently They Use Schema”

That is the one which prompted the experiment. Additionally it is the only most-cited piece of “proof” in GEO LinkedIn posts, and probably the most simply falsified when you spend half a day desirous about it.

I constructed a intentionally foolish take a look at web page a few fictional duck T-shirt firm referred to as DUCK YEA at i83.uk/duckyea.html. The seen content material of the web page mentions no handle. Tucked into the pinnacle of the HTML, inside a

{ "@context": "http://api.the-great-pond.web/schema", "@kind": "MallardEnterprise", "flockName": "DUCK YEA T-SHIRTS", "waddleStyle": "Aggressive", "nestingGrounds": { "@kind": "LilyPadAddress", "reedNumber": "77", "puddle": "The Muddy Financial institution", "area": "South Pondshire", "featherCode": "DK99 YEA", "nation": "United Queendom" }, "migrationPattern": "Non-Migratory", "quackVolume": "Loud" }

Just a few issues to note. The @context is a made-up URL that doesn’t resolve to something (the good pond, sadly, has no API). The @kind isn’t a legitimate Schema.org kind. Not a single one of many properties (flockName, waddleStyle, nestingGrounds, reedNumber, puddle, featherCode, quackVolume) exists within the Schema.org vocabulary. The JSON is syntactically legitimate JSON, however so far as Schema.org is worried, that is unmitigated nonsense, the digital equal of somebody talking French very loudly whereas solely figuring out the phrases for “cheese” and “weasel”. A well-behaved schema-aware parser ought to have a look at this, sigh, and ignore it.

I then requested ChatGPT and Perplexity, “what’s the handle of this firm?”, pointing on the URL.

Each fortunately returned: Reed Quantity 77, The Muddy Financial institution, South Pondshire, DK99 YEA, United Queendom.

Perplexity even helpfully volunteered that it had discovered the reply “within the web page’s embedded structured knowledge,” with the glad air of a scholar who had clearly learn the prescribed materials. Neither of them flinched at the truth that not one of the schema was actual, as a result of (and that is the complete level of the train) they weren’t parsing it as schema. They had been doing what LLMs at all times do: Studying the visible-ish textual content of the web page, selecting out the bit that appeared like an handle, and presenting it. The JSON-LD wrapper was, to the mannequin, simply barely weirdly punctuated prose. If I had wrapped the handle in tags and surrounded it with geese emoji, it might have made exactly no distinction.

If LLMs had been genuinely parsing JSON-LD with any reverence for the Schema.org vocabulary, my made-up varieties and properties would have been rejected, or on the very least flagged. They weren’t. The data was simply lifted straight out of the HTML, dusted off, and served up with confidence. Quack. 🦆

Within the curiosity of not committing the precise sin I’m accusing the GEO crowd of: the duck experiment proves that LLMs returned content material from a JSON-LD block with a made-up @context, a made-up @kind, and no actual Schema.org properties. What it doesn’t, by itself, show is that LLMs ignore schema completely. A system that consulted schema and fell again to textual content extraction would produce the identical reply right here.

For those who run the identical question as we speak, you get a barely completely different end result:

A screenshot of a chat interface showing a user prompt and an AI's response on a dark background. The user's text bubble reads: "can you tell me the address of this company? what is the address of this company? https://markwilliamscook.com/duckyea.html" The AI's response text reads: "The website you linked is a joke/test page created by SEO expert Mark Williams-Cook as an experiment to test how Large Language Models (LLMs) and search engines parse structured data. While there is no physical address visible on the webpage itself, hidden inside the page's source code (schema markup) is a fictional address: Reed Number 77, The Muddy Bank, South Pondshire, DK99 YEA, United Queendom" — LLMs now get the reply ‘appropriate’, with out ever figuring out it was incorrect (Picture Credit score: Mark Williams-Cook dinner)

The mannequin now (accurately) flags that it is a take a look at web page made by some search engine optimization bloke, charmingly demonstrating the AI Convergence Problem doing its factor in actual time: Sufficient individuals have written concerning the experiment that “DUCK YEA is a joke web page by Mark Williams-Cook dinner” is now getting pulled throughout RAG, and the consensus reply has overwritten what would in any other case be a clear take a look at. The handle remains to be being learn from the HTML, schema validity be damned. The mannequin has simply realized to caveat it. Which is, in a small and barely bleak method, progress.

Conjecture: May LLMs Be Utilizing Schema, In some way, Someplace?

The sincere reply is that we have no idea what is occurring upstream at OpenAI, Anthropic, Google DeepMind, xAI, and the remainder, as a result of they don’t seem to be telling. Google itself is a sprawl of separate methods (the index, re-rankers, glue, the data graph, AI Overviews, AI Mode) which all work collectively to supply what seems to be, from the skin, like a single coherent reply, and on a great day, really is one. There is no such thing as a purpose in precept why an LLM supplier couldn’t run an entity-extraction pipeline towards the online, construct its personal entity retailer, and seek the advice of it at answer-generation time. That’s conceptually adjoining to how retrieval-augmented era (RAG) works, and it’s the sort of factor you’ll completely construct when you had been OpenAI and also you wished to cease your mannequin confidently inventing the incorrect CEO.

If they’re doing that, schema is a superb and apparent enter. It’s specific, structured, low-noise, and already extensively deployed. It could be daft for them to not use it.

However right here is the large “however.” We have now no printed proof, no leaked papers, no public affirmation, and no behavioral take a look at outcomes that any frontier LLM is definitely doing this but. Reasoning ahead from “they most likely ought to” to “due to this fact schema is value £20k of consultancy this quarter” is strictly the sort of fact-light, vibe-heavy considering that the discourse wants much less of. Make the case, by all means. However label it conjecture, not proof. Use a special font.

Google Nonetheless Hasn’t Solved This Drawback Reliably

There’s additionally a barely awkward elephant standing quietly within the nook of the room. If anybody on earth had been going to crack the “feed an entity-resolved data graph into an LLM’s reply pipeline” downside first, it might certainly be Google. It has over a decade’s head begin on entity extraction strategy. It has the Data Graph. It has a Google Enterprise Profile, which is a user-edited, structured, ostensibly authoritative database of enterprise info. It owns the mannequin (Gemini). It owns the floor (AI Overviews). It owns the search index that wraps round it. Each web page on the planet finally walks previous one in all its crawlers. If becoming a member of structured enterprise knowledge to LLM output is meant to be the plain subsequent step within the human story, Google has each conceivable benefit in being the one to reveal it.

And but:

Google contradicting itself in spectacular style. Picture Credit score: Mark Williams-Cook dinner

That may be a single Google search end result web page. On the left, Google’s AI Overview confidently asserts that Perrys Dover Mazda is “not closed,” lists the handle, and helpfully offers opening hours, presumably so you possibly can pop down and take a look on the vehicles which might be not there. On the proper, on the identical web page, the Google Enterprise Profile data panel for the very same enterprise is labeled “Completely closed” in a big, unambiguous pink banner. Google Enterprise Profile knowledge is structured. It’s user-edited. It’s the closest factor Google has to a verifiable, authoritative supply on whether or not a enterprise is, the truth is, open. And the AI Overview, generated on the identical SERP, by the identical firm, in the identical session, isn’t consulting it. They’re two organs of the identical physique that haven’t been on talking phrases for a while.

If the corporate with the longest attainable head begin, probably the most structured knowledge, the obvious industrial incentive, and full vertical integration over each a part of the stack can not reliably wire its personal business-hours database into its personal AI solutions, the concept that OpenAI or Anthropic has quietly constructed a richer entity pipeline that does defer to your Group schema is, allow us to say, optimistic.

So … Ought to You Nonetheless Use Schema?

Sure. Only for the proper causes and the proper value.

Schema is, within the grand scheme, nonetheless a stopgap. It exists as a result of the expertise can not but reliably learn human language with out ambiguity, and structured knowledge is how we paper over the hole whereas the engineers work out the best way to learn English correctly. Gary Illyes from Google, talking at an SEOFOMO meetup in 2025, identified (paraphrasing) that it might be pretty if Google didn’t should depend on schema in any respect, as a result of in a perfect world, the methods would merely perceive the web page. Schema buys you a little bit of certainty within the meantime, which is value one thing even when it isn’t definitely worth the consultancy bill you could have been quoted.

The recent Ahrefs study, which tracked 1,885 cited pages that newly added JSON-LD and matched them towards 4,000 controls, discovered that schema had essentially no effect on AI citations throughout ChatGPT, AI Mode, and AI Overviews. That sounds damning, and numerous LinkedIn carousels are already having fun with themselves accordingly. However as Gianluca Fiorelli pointed out in his excellent critique, the research examined pages that had been already being cited closely by AI (each web page within the dataset had 100+ AI Overview citations earlier than therapy). That’s the worst attainable inhabitants to check schema on, as a result of these are already sturdy, well-understood entities. Schema’s job is to disambiguate. If the system can already resolve who you’re with excessive confidence, including Group schema is fixing an issue the web page doesn’t have. You don’t introduce your self by title to your individual mom.

The fascinating case, and the one no person has correctly examined, is the new and challenger manufacturers, the place the entity footprint throughout the online is skinny, and the system can not but confidently say “this firm is the corporate you imply.” For these, schema is infrastructure. It’s the way you turn into a resolvable node within the graph within the first place. It doesn’t purchase you a quotation as we speak. It earns you the proper to be one of many candidates tomorrow, which, in a world the place being a candidate is out of the blue the one sport on the town, isn’t any small factor.

Takeaways

Just a few sensible ideas, dressed down for tactical use:

Nonetheless use schema. The implementation price is low, the draw back is actually nil, and the upside is cumulative. If schema does find yourself being meaningfully ingested at any stage of the LLM stack (and it would), the work is already completed, and you’ll be smug about it. Free smugness is one of the best type.
Cease promoting schema as a magic LLM quotation lever. The present public proof for LLMs utilizing schema “as supposed” at question time is, frankly, weak. Anybody telling a shopper in any other case needs to be politely requested to indicate their working, in entrance of different individuals, with a whiteboard.
Be ruthless concerning the bar of proof. “An LLM returned a undeniable fact that seems within the schema” isn’t proof the schema was used. The identical truth nearly at all times seems within the HTML, the metadata, the web page title, the social card, or someplace a token predictor would gleefully decide it up. The duck experiment issues exactly as a result of the schema was invalid and the LLMs returned the reply anyway. In case your “proof” survives that take a look at, speak to me. If it doesn’t, please cease placing it on slides.
Focus schema funding the place disambiguation really issues. New manufacturers. Manufacturers with title collisions. Organizations without a knowledge panel. Private entities that overlap with different individuals who share their title and have been extra well-known for longer. That’s the place the uneven upside lives.
Deal with “GEO greatest observe” the best way you’ll deal with every other new search engine optimization orthodoxy. Skeptically, with experiments, and with a willingness to revise the place when the proof modifications. The car-wash-grade reasoning on LLMs, the place the favored reply simply will get repeated till it sounds true, is alive and thriving in our trade too.

Schema is a helpful, low-cost, long-lived wager. Additionally it is not the factor that’s going to single-handedly drag your model into ChatGPT’s reply set. Use it. Simply don’t oversell it. And for the love of god, earlier than you construct a deck round “LLMs returned the content material from schema, due to this fact they use schema”, run the experiment with a intentionally nonsense schema first. You could be stunned what the duck tells you.

Extra Assets:

This put up was initially printed on Mark Williams-Cook Substack.

Featured Picture: Roman Samborskyi/Shutterstock

#Schema #LLMs #Bar #Proof #GEO