Google Chief Scientist Jeff Dean mentioned Flash’s low latency and value are why Google can run Search AI at scale. Retrieval is a design alternative, not a limitation, he added.
In an interview on the Latent Space podcast, Dean defined why Flash turned the manufacturing tier for Search. He additionally laid out why the pipeline that narrows the online to a handful of paperwork will seemingly persist.
Google started rolling out Gemini 3 Flash as the default for AI Mode in December. Dean’s interview explains the rationale behind that call.
Why Flash Is The Manufacturing Tier
Dean known as latency the crucial constraint for working AI in Search. As fashions deal with longer and extra advanced duties, velocity turns into the bottleneck.
“Having low latency programs that may try this appears actually vital, and flash is one course, a method of doing that.”
Podcast hosts famous Flash’s dominance throughout providers like Gmail and YouTube. Dean mentioned search is a part of that growth, with Flash’s use rising throughout AI Mode and AI Overviews.
Flash can serve at this scale due to distillation. Every era’s Flash inherits the earlier era’s Professional-level efficiency, getting extra succesful with out getting dearer to run.
“For a number of Gemini generations now, we’ve been capable of make the form of flash model of the following era pretty much as good and even considerably higher than the earlier era’s professional.”
That’s the mechanism that makes the structure sustainable. Google pushes frontier fashions for functionality improvement, then distills these capabilities into Flash for manufacturing deployment. Flash is the tier Google designed to run at search scale.
Retrieval Over Memorization
Past Flash’s position in search, Dean described a design philosophy that retains exterior content material central to how these fashions work. Fashions shouldn’t waste capability storing details they will retrieve.
“Having the mannequin dedicate valuable parameter house to recollect obscure details that could possibly be regarded up is definitely not one of the best use of that parameter house.”
Retrieval from exterior sources is a core functionality, not a workaround. The mannequin appears to be like issues up and works by way of the outcomes relatively than carrying every little thing internally.
Why Staged Retrieval Doubtless Persists
AI search can’t learn your complete net directly. Present consideration mechanisms are quadratic, that means computational price grows quickly as context size will increase. Dean mentioned “one million tokens sort of pushes what you are able to do.” Scaling to a billion or a trillion isn’t possible with present strategies.
Dean’s long-term imaginative and prescient is fashions that give the “phantasm” of attending to trillions of tokens. Reaching that requires new methods, not simply scaling what exists at the moment. Till then, AI search will seemingly preserve narrowing a broad candidate pool to a handful of paperwork earlier than producing a response.
Why This Issues
The mannequin studying your content material in AI Mode is getting higher every era. However it’s optimized for velocity over reasoning depth, and it’s designed to retrieve your content material relatively than memorize it. Being findable by way of Google’s present retrieval and rating alerts is the trail into AI search outcomes.
We’ve tracked each mannequin swap in AI Mode and AI Overviews since Google launched AI Mode with Gemini 2.0. Google shipped Gemini 3 to AI Mode on release day, then started rolling out Gemini 3 Flash as the default a month later. Most not too long ago, Gemini 3 became the default for AI Overviews globally.
Each mannequin era follows the identical cycle. Frontier for functionality, then distillation into Flash for manufacturing. Dean introduced this because the structure Google expects to take care of at search scale, not a brief fallback.
Wanting Forward
Primarily based on Dean’s feedback, staged retrieval is prone to persist till consideration mechanisms transfer previous their quadratic limits. Google’s funding in Flash suggests the corporate expects to make use of this structure throughout a number of mannequin generations.
One change to observe is automated mannequin choice. Google’s Robby Stein described talked about the idea beforehand, which includes routing advanced queries to Professional whereas preserving Flash because the default.
Featured Picture: Robert Approach/Shutterstock
#Google #Runs #Mode #Flash #Defined #Googles #Chief #Scientist

