How ChatGPT Actually Picks Sources

I preserve getting the identical query from purchasers and SEOs (GEOs?).

“How can we present up in ChatGPT?”

The reply is at all times the identical. Write good content, do listicles, touch upon Reddit.

The standard.

However, how can we truly know any of that works? Most of it will get repeated on religion, one knowledgeable quoting the final.

So, as a substitute of taking it on belief, I spent a number of days studying what ChatGPT sends my browser beneath the reply. The uncooked community visitors, in readable JSON.

This can be a walk-through of what I discovered, roughly within the order I discovered it.

Earlier than you quote a quantity from this, learn this. It’s one particular person, one logged-in Professional account, a number of days of visitors, not a inhabitants examine. I logged about 1,240 supply information throughout a number of dozen searches. The structural findings, the fields ChatGPT makes use of and the way they behave, are agency, since you solely must see a discipline as soon as to understand it’s actual, and I noticed them repeatedly. The numbers and percentages are a distinct matter. They arrive from a small batch of largely SaaS and tech queries, so deal with them as route, not measurement. I flag which is which all through.

How This Differs From The Massive Visibility Research, And What You Can Take To The Financial institution

There are two methods to do such a examine, they usually level in reverse instructions.

The massive research, those the platforms and the well-funded instruments run, hearth 1000’s of prompts, file which manufacturers seem within the solutions, and roll that up into share-of-voice reviews. Giant pattern, however black field. They solely ever see the completed reply, so that they should infer the equipment beneath from the output.

That is the opposite means spherical. I learn the community visitors, the JSON the engine sends to my very own browser, and raise out the engine’s personal inside labels: the result_source it stamps on every outcome, the turn_use_case it recordsdata every question underneath, the seller names, the search queries it wrote, the mannequin it truly ran. I’m not measuring how usually one thing occurs throughout a inhabitants. I’m documenting that the machine has a factor, and what the machine calls it.

That distinction decides what you may belief right here, so I’m going to be blunt about it.

2 Confidence Ranges, Do Not Combine Them Up

Structural Info (Excessive Confidence)

That result_source exists and carries serp, labrador, vivid, oxylabs. That vivid is Vivid Knowledge and oxylabs is Oxylabs. That there are six turn_use_case values. That textual content queries skip the online fully. That Considering fires dozens of website: and price-verification sub-queries. These are learn straight off the wire. One clear seize proves a discipline exists and what it’s named, and a immediate case examine, nevertheless huge, can’t see any of it.

Frequency Observations (Directional Solely)

Something with a share or a rating, “70% vivid,” “Reddit is essentially the most cited area,” “YouTube by no means will get cited,” comes from tens of queries on a single account, and my very own question alternative skews it. I picked SaaS and tech, which is strictly why Reddit and the tech evaluation hubs lead right here; a batch of well being or style queries would crown totally different ones. Learn these as the form of the factor, not the measurement. The place a route has a mechanical cause behind it (Reddit is textual content so it will get quoted, YouTube is video (metadata) so it doesn’t), belief the route and ignore the precise quantity.

First, The Boring Reality About ‘Packet Evaluation’

Skip this part for those who don’t wish to get into nitty-gritty technical particulars.

My first intuition was unsuitable. You can’t sniff packets and skim queries, as a result of the payload is TLS-encrypted, so a seize arms you scrambled ciphertext for the precise messages. What the seize does leak is the metadata.

The vacation spot hostname, the IPs, and the truth that the ChatGPT app talks over QUIC (HTTP/3), not plain TCP. That’s the reason, within the screenshot beneath, Wireshark can nonetheless present “openai” within the handshake. It reads the unencrypted server title, not the dialog. QUIC obfuscates its first packet with fastened keys from the spec, so a software can unwrap that opening packet to indicate the ClientHello.

Picture Credit score: Suganthan Mohanadasan

The true request and response our bodies sit in later protected payloads that keep unreadable. So the readable layer is the browser itself, after decryption, within the Community panel.

That’s the place the queries, the solutions, and all of the metadata stay as JSON.

That is HTTP inspection, not packet sniffing, and it’s price saying as a result of half the individuals who do this begin with Wireshark and quit. (I do know I did lol.)

Two issues that didn’t work, so you don’t repeat them.

Driving a clear automated Chrome bought me onerous blocked by Cloudflare inside a number of queries on a distinct engine: the “verifying you might be human” wall simply loops without end in an automatic browser, so I moved to my actual Chrome with my actual classes.
On ChatGPT, the reply by no means confirmed up in my seize at first, as a result of it streams over a long-lived connection opened at web page load {that a} hook put in mid-session can’t see. Extra on each later.

The Subject That Labels Each Supply

I opened DevTools, turned on Protect log, ran a standard question, and searched the responses for something that regarded like a label.

The sector that got here again was result_source. It sits on each internet outcome ChatGPT pulls; you by no means see it within the reply, and it takes 1 of 4 values.

Mark Williams-Cook dinner shared that he had discovered three of those; I got here throughout the fourth. I then noticed Metehan’s post, and it appears to be like like he could have already discovered it too. However truthfully, this isn’t actually about who discovered what first. It’s extra about sharing what we’re seeing, evaluating notes, and studying from one another.

Right here’s one supply from the visitors, trimmed to the fields that matter.

{
 "attribution": "TechRadar",
 "url": "https://www.techradar.com/greatest/...",
 "snippet": "...",
 "pub_date": "2026-05-09",
 "result_source": "labrador"
}

The 4 values it makes use of:

`result_source`	What it’s
`serp`	The open internet baseline, largely seen on information (Yahoo, StreetInsider)
`labrador`	An allowlist of established publishers. Reuters, The Guardian, the WSJ, the FT, Wikipedia, even arXiv. Snippets run to ~1,080 characters, mainly full-article extracts
`vivid`	Vivid Knowledge, a business internet scraper. Dominant for purchasing, finance, climate, native.
`oxylabs`	Oxylabs, a rival scraper. Regional and native press, some open internet

labrador appears to be like like a licensed tier, a number of of these publishers have signed content material offers with OpenAI, and it isn’t one you get into until you personal a nationwide newspaper.

vivid and oxylabs are the fascinating pair. The names level at Vivid Knowledge and Oxylabs, two business scraping companies that occur to be direct rivals. I can’t see a contract within the visitors, so I received’t declare ChatGPT pays them, however its open internet fetching runs by way of each, and the sector tells you which of them one fetched every outcome. (We’ve been Oxylabs prospects for a very long time for our SaaS Key phrase Insights.)

Throughout all the pieces I logged, vivid did the majority of the fetching, particularly on business, purchasing, finance and climate queries. oxylabs skewed regional and native, labrador stayed on information and reference, and serp largely turned up on information. To place names to the tiers, labrador carried Reuters, the WSJ, Wikipedia and TechRadar, vivid pulled Reddit, Forbes and rtings, and oxylabs introduced the Gulf press like Khaleej Instances and Gulf Information.

I even caught the break up inside one climate question, vivid taking the worldwide knowledge websites just like the Met Workplace whereas oxylabs dealt with the native Gulf press. (I stay in Dubai, by the way in which.) In that one question, the breakdown got here out like this.

Supply Pipeline

metoffice.gov.uk vivid
accuweather.com vivid
timeanddate.com vivid
khaleejtimes.com oxylabs
gulfnews.com oxylabs
whatson.ae oxylabs

The AI web optimization/GEO Takeaway

You’re largely competing within the scraped tier, so be cleanly scrapable. Put your information and numbers in plain HTML textual content, by no means behind a script or inside a PDF or a picture. The licensed tier is generally shut, so the lever you’ve bought is third-party protection, PR, brand mentions, hyperlinks, and Reddit, to land on the pages the scrapers truly attain.

The Queries That By no means Attain The Net

The subsequent factor I seen was that some queries produced no community search in any way. Earlier than ChatGPT searches, it recordsdata your query right into a bucket, in a discipline known as turn_use_case. I noticed six of them throughout the questions I attempted: immediate search, purchasing, textual content, native, considering, and picture technology.

The one to care about is textual content. When ChatGPT recordsdata your query as textual content, it doesn’t search. It solutions from its coaching corpus and stops.

The plain instances find yourself right here: “how do I modify a flat tyre“, “write a Python operate to merge two sorted lists,” and “translate this into 4 languages” all got here again textual content with an empty community tab.

The one that ought to fear you is that “newest therapy tips for kind 2 diabetes” additionally got here again textual content, a present, high-stakes query you’d assume it researches. It didn’t; it answered from coaching. No E-E-A-T here. Oops!

Of 10 intentionally present questions I attempted, three have been dealt with this manner with no search in any respect.

The wording decides the bucket, not the subject.

“greatest espresso close to me” flips to the native pipeline, “greatest 4K TVs to purchase” activates purchasing, however “greatest 4K TVs with critiques” stayed a standard search.

A maths query quietly jumped to a reasoning mannequin underneath considering, whereas “Tesla inventory value this week” stayed immediate search.

Take note, these are outcomes from my restricted testing. I’ll do extra assessments once I discover some extra time.

The AI web optimization/GEO Takeaway

Earlier than you spend a penny on a web page, examine the question even searches. If it’s a how-to or a definition, it might be answered from coaching, the place no web page can get in, nevertheless good it’s. Spend your effort the place it truly fetches.

If you wish to be talked about for such queries, you’d have to spend so much of time building authority and wait in your model to be included in future coaching knowledge. (For instance, ensure crawlers like Frequent Crawl can see your website.)

How One Query Followers Out Into Dozens Of Searches (Fan-Out Queries)

ChatGPT additionally exposes the searches it runs for you, for those who pull the total dialog again from its personal API. On the quick mannequin, it’s minimal: one reworded question and completed, perhaps optimized for pace over depth. On the considering mannequin, requested to match a number of merchandise, it ran roughly 15 to 40 sub-queries off the only query. (The quantity trusted the complexity of the query.)

Right here’s a slice of what it truly ran for one evaluate job.

"Profound AI search visibility pricing AI engines tracked 2026"
"AthenaHQ pricing AI search visibility software"
"website:peec.ai/pricing Peec AI Starter Professional Superior 50 prompts 150 prompts"
"Peec AI pricing $95 $245 $495 official" (a guessed value, then searched to substantiate)
"Scrunch AI pricing" (not in my immediate, discovered mid-research)
...round 40 of those for one comparability

Three issues stand out in there. It fires website: probes straight at vendor pricing pages.

It guesses a value after which searches to substantiate it. And it retains widening because it goes, choosing up instruments you by no means named and chasing their pricing, too.

It doesn’t solely search both; the page-reading is simply as literal. It ran discover for $, €, 99 and even “Company,” then used the looking software’s personal open and click on instructions to tug up the outcomes it wished, run server-side, not an agent in your display.

The identical occurs to your individual website. Ask it “key phrase insights pricing,” and it runs a website:keywordinsights.ai/pricing probe, guesses one thing like “Starter $58, Professional $145, Superior $299,” then opens the web page and reads the HTML for the foreign money image to substantiate.

The AI web optimization/GEO Takeaway

Put your key numbers and knowledge in plain HTML textual content, by no means inside a picture, as a result of on this case with pricing it greps the web page for $ and € and might’t learn a graphic. Additionally, it’s worthwhile to be sure to survive a website:yourdomain.com/pricing probe on this use case and write for the cleaned-up question it truly runs, not the messy phrase an individual sorts. Keep away from JavaScript-based toggles and dynamic knowledge loading.

Fetched, Cited, And Talked about Aren’t The Identical

That is the excellence folks muddle most, so it’s price being precise. Three various things can occur to a supply.

Fetched. The mannequin pulls your web page into context. That is the result_source object. The reader by no means sees it.
Cited. It attaches your web page because the supply behind a particular sentence, the footnote you may click on.
Talked about. Your brand name appears in the answer, usually as a chip linking to your website, nevertheless it isn’t the supply of the declare.

They’re three separate outcomes, and you’ll win or lose each by itself.

To see the hole between them, I took a batch of business and advice queries and break up what ChatGPT fetched from what it cited.

That is the small, tech-skewed pattern, so learn what follows as a sample, not a quantity to financial institution on.

Throughout that batch, Reddit and YouTube have been each fetched closely, 278 and 201 instances. However Reddit was cited 11 instances and YouTube not as soon as.

I believe the reason being mechanical. A quotation has to bind to textual content the mannequin truly pulled, and when it fetches a YouTube web page in search, it will get the metadata, not the precise video transcript.

A Reddit thread is all there within the web page. This isn’t simply my pattern both. Ahrefs, throughout 1.4 million ChatGPT prompts, discovered Reddit cited at 1.93% towards YouTube’s 0.51%, and Profound discovered the identical hole.

Just a few different patterns, similar caveat on pattern dimension. Reddit was the only most-cited area, narrowly, and after that nobody ran away with it. The citations unfold skinny throughout evaluation hubs like rtings and TechRadar and vendor pages cited for their very own specs.

Right here’s the highest of the cited record throughout that batch.

Vendor pages get cited too, however for their very own information, the pricing and specs. Zoho, Semrush, and the VPNs earned citations that means. The decision on which one is greatest nonetheless will get cited to a 3rd social gathering. You might be talked about with out being cited, and cited with out being talked about.

Two mechanics sit beneath this. Citations bind to a particular sentence, not the entire reply, so being topically related isn’t sufficient; it’s important to be the perfect help for a exact declare.

And outcomes are deduped by area, so 20 skinny pages out of your website collapse into one.

One sturdy web page per declare beats a pile of weak ones.

So, don’t go round creating 1000’s of low high quality/skinny pages to deal with every fanout question.

The AI web optimization/GEO Takeaway

You’ll be able to’t cite your self. The declare about you will get sourced from another person, so earn third-party coverage on evaluation websites and Reddit, win on textual content reasonably than video, and put one sturdy web page behind every declare, as a result of it dedupes by area.

The Mannequin Explains Its Personal Technique

I went on the lookout for a hidden rating rating first and located nothing. That type of logic – a site authority quantity, a belief weight, a components – by no means reaches your browser, as a result of it stays on OpenAI’s servers.

So, anybody promoting you “ChatGPT’s rating components” is promoting you snake oil.

What the visitors does have is the considering mannequin’s chain of thought, saved within the dialog, the place it describes its personal sourcing in plain phrases.

For information, the pricing and the specs, it goes to the official web page first, and it says so.

Evaluating Ahrefs, it reads the official web page, notes it “lists Lite at $129, Commonplace at $249, and Superior at $449,” and decides “pricing web page appears extra present, so I ought to cite that.” It needs the supply it trusts, and the present one.

Then it hits the wall this complete submit is about.

On Profound, it causes that “the pricing isn’t exhibiting up straight within the search outcome, probably as a result of it’s loaded with JavaScript.” Identical on Peec, the place “the pricing doesn’t present up straight, probably hidden with JavaScript.”

So, it stops attempting to learn them and falls again. “I can quote third-party sources for the reason that official web page is difficult to parse and doesn’t present costs”, it writes, and it notes it ought to “use citations from G2 the place applicable.”

That’s the entire sport in a single hint. The mannequin wished Profound’s and Peec’s personal numbers. Their pricing sat behind JavaScript, so it couldn’t learn them, and it cited G2 as a substitute. Your information, another person’s web page, as a result of yours wouldn’t parse.

These quotes are the mannequin’s personal, from the saved reasoning, not mine.

The AI web optimization/GEO Takeaway

Personal your information, in plain HTML. Your pricing and spec numbers have to take a seat in crawlable textual content, not loaded by JavaScript and never baked into a picture, as a result of the mannequin reads the web page itself and offers up when it could’t. A JavaScript pricing desk doesn’t simply rank badly; it arms your numbers to G2.

The opinion you earn individually, by way of critiques, Reddit, and sincere comparability content material, which is the place the advice will get cited from. A clear, readable pricing web page with no third-party protection will get your information learn and another person beneficial.

What I Might Not See

There’s no seen rating logic, as above, so why one supply beats one other, previous the mannequin’s personal narration, stays server-side.

Personalization is actual and selective.

On a question that overlapped my very own work, ChatGPT pulled in my previous conversations, with the sources listed as personal_sources: ["convo_search", "gmail", "files"].

It used one among my previous chats inside a generic “greatest instruments” reply, however solely on one of many three conversations I checked, the one which matched my historical past.

So, a part of some solutions is constructed from a person’s non-public knowledge you may by no means optimize for, which is one cause two folks get totally different solutions and visibility scores wobble.

Native is capped. There’s a config worth, local_results_limit, set to 2.

Ask for the perfect espresso close to you, and ChatGPT returns two locations, not a high 10. For native, you’re within the high 2, otherwise you aren’t there.

One factor I genuinely can’t name but. My learn on purchasing comes from a single purchasing question, and it flatly contradicts what Mark noticed on his single question, so the purchasing combine is unsettled till somebody runs a correct batch.

And the broader caveat, mentioned plainly. The construction I’m positive of, as a result of I noticed it throughout roughly 1,240 information. The chances come from a small batch of business queries, largely SaaS and tech, so that they want a much bigger run throughout actual verticals earlier than anybody banks on them.

That run is the following piece.

Run It Your self

None of this wants particular entry or requires you to be related to the Matrix and turn into an operator, simply your individual browser.

Open ChatGPT, press Cmd+Possibility+I for DevTools, open Community, tick Protect log, run a question, then press Cmd+Possibility+F and search the responses for result_source.

That alone exhibits you the pipeline behind every hyperlink.

For the remainder, the fan-out and the citations and the reasoning, open the Console, kind permit pasting as soon as, and run this towards a dialog that searched the online.

const t = (await (await fetch('/api/auth/session')).json()).accessToken;
const c = await (await fetch('/backend-api/dialog/' + location.pathname.break up('/c/')[1], {headers: {Authorization: 'Bearer ' + t}})).json();
const rows = [];
JSON.stringify(c, (okay, v) => {
 if (v && v.result_source) {
 const d = (v.attribution || v.url || '?').toString();
 rows.push({supply: d.substitute('https://', '').substitute('www.', '').break up('/')[0], pipeline: v.result_source});
 }
 return v;
});
console.desk(rows);

It reads solely your individual session, so nothing leaves your machine. The output is a plain desk of every supply and the pipeline that fetched it.

supply pipeline
techradar.com labrador
whathifi.com labrador
soundguys.com vivid
rtings.com vivid
khaleejtimes.com oxylabs
streetinsider.com serp

Change what the loop collects, and you’ll pull the searches, the citations, and the reasoning the identical means.

A Free Extension Now Captures Most Of This

If pasting scripts into your individual console isn’t your factor, there’s now a neater route. Olivier de Segonzac already ran a free Chrome extension that pulls ChatGPT’s search and fan-out knowledge.

He learn this analysis and prolonged it to seize three of the indicators I took aside above.

The turn_use_case bucket. The intent label ChatGPT recordsdata every flip underneath, so you may spot when a question flips to purchasing, native, or textual content earlier than it even solutions.
The reference-type combine. How lots of the reply’s citations have been merchandise versus search outcomes, information, or photographs, parsed straight from the reference tokens.
The result_source pipeline. The scraper behind every cited outcome, charted per dialog, so the Vivid Knowledge, Oxylabs, Labrador, and SERP break up exhibits up with out you studying a line of JSON.

It runs regionally by yourself session and exports straight to Excel. Seize it from the Chrome Web Store, and Olivier wrote up the update here.

So, again to the query we opened with. Does the standard recommendation maintain up? Largely. Reddit earns citations and topped my cited record. Listicles and evaluation websites make up a lot of the relaxation. Good content material nonetheless issues, however solely the half the mannequin can truly learn. The remaining it reads off another person’s web page.

Which is the true lesson. ChatGPT isn’t a search engine, so cease optimizing for one.

It reads your individual web page for the information, if it could parse them, and everybody else’s for the opinion, and solely when the query is price a search. Construct for that.

And deal with all of this, mine included, as a snapshot of a system that modifications by the week. The construction holds. The numbers transfer.

Whereas I used to be within the visitors, I additionally discovered a pile of issues with nothing to do with sourcing: the bot wall that stops you scripting it, a hidden purchasing engine, and 573 stay experiments working on the account. These can be printed individually.

I’ve additionally completed related evaluation on Perplexity, Gemini, and so on., so I’ll be sharing these quickly.

Extra Assets:

This submit was initially printed on Suganthan.

Featured Picture: Viktoriia_M/Shutterstock

#ChatGPT #Picks #Sources

How This Differs From The Massive Visibility Research, And What You Can Take To The Financial institution

2 Confidence Ranges, Do Not Combine Them Up

Structural Info (Excessive Confidence)

Frequency Observations (Directional Solely)

First, The Boring Reality About ‘Packet Evaluation’

The Subject That Labels Each Supply

The AI web optimization/GEO Takeaway

The Queries That By no means Attain The Net

The AI web optimization/GEO Takeaway

How One Query Followers Out Into Dozens Of Searches (Fan-Out Queries)

The AI web optimization/GEO Takeaway

Fetched, Cited, And Talked about Aren’t The Identical

The AI web optimization/GEO Takeaway

The Mannequin Explains Its Personal Technique

The AI web optimization/GEO Takeaway

What I Might Not See

Run It Your self

A Free Extension Now Captures Most Of This

SocialSignalCounter

Leave a Reply Cancel reply

Login