How AI helped build hreflang XML sitemaps at scale

How AI helped build hreflang XML sitemaps at scale

As AI software utilization has change into extra frequent, I’ve seen spectacular examples of individuals constructing instruments to automate advanced processes that after required important guide effort. I’ve additionally seen groups undertake AI just because it’s out there, typically with little sensible profit.

My method is to deal with AI functions that save time and resolve actual issues.

Lately, I wanted to align the website positioning structure for greater than a dozen web sites throughout three separate companies, eight regional domains, and a number of languages, together with three English dialects, Italian, Japanese, Spanish, Thai, French, and Korean.

Traditionally, mapping 1000’s of URLs to create cohesive hreflang XML sitemaps would have required specialised software program or days of spreadsheet work. As a substitute, I used Google Gemini to construct a customized Python script that dealt with the heavy lifting.

Right here’s how the challenge advanced from an preliminary immediate right into a extremely personalized automation software, and what it taught me about utilizing AI for technical website positioning.

The place AI delivers essentially the most worth

I take advantage of AI primarily for sensible, time-saving duties, together with:

  • Producing regex patterns once I want a fast answer with out researching syntax from scratch.
  • Creating advanced spreadsheet formulation for reporting workflows that depend on guide knowledge exports.
  • Accelerating analysis and planning for initiatives that require aggressive evaluation throughout a number of enterprise strains.
  • Constructing customized automation instruments for recurring website positioning and data-processing duties.

The hreflang challenge mentioned right here falls into that remaining class.

Mapping hreflang at scale

The problem was clear: map 1000’s of URLs throughout greater than a dozen multilingual web sites into correct hreflang XML sitemaps.

Somewhat than tackling the challenge manually, I used Google Gemini to assist construct a customized Python answer.

Right here’s how the method unfolded.

Part 1: Asking for an method, not only a script

A standard pitfall when utilizing generative AI for coding is asking it to dash earlier than it is aware of the route. In case you merely sort, “Write a Python script to create an hreflang sitemap,” you’ll get a generic, fragile piece of code that breaks the second it encounters real-world knowledge.

As a substitute, I began by asking for an method. I defined the situation: a number of regional domains, natural development over a number of years leading to mismatched URL slugs, translated subfolders, and appended revision years.

Gemini prompt a multi-step, data-driven method:

  • Crawl the web sites to gather reside URLs and their metadata.
  • Use Python in Google Colab to course of the uncooked knowledge.
  • Run an actual match cluster first to group similar slugs.
  • Use a complicated semantic AI mannequin (reminiscent of SentenceTransformers) to fuzzy match translated pages based mostly on their titles and normalized URLs.

Part 2: Crawling and knowledge assortment

Following the technique, I used a crawler to spider all of the regional web sites. The objective was to generate a unified comma-separated values (CSV) file containing the reside URLs, standing codes, title tags, and H1s. Screaming Frog labored completely for this utility.

A essential level: Your AI output is simply nearly as good as your crawl knowledge (keep in mind the outdated saying, “rubbish in, rubbish out”).

An AI script will fail to map an apparent “actual match” if the goal URL is a 404 or a 301 redirect in your supply knowledge. It’s essential to filter your CSV to incorporate solely indexable content material earlier than feeding it to the script.

Dig deeper: International SEO in 2026: What still works, what no longer does, and why

Get the publication search entrepreneurs depend on.


Part 3: The Google Colab sandbox

Google Colab offers a free, cloud-based Jupyter pocket book setting the place you may write, paste, and execute Python code with out worrying about native installations or setting variables. You possibly can entry it by Google Drive. I discovered the free model had sufficient capability to deal with this challenge.

I uploaded the CSV to Colab, and Gemini offered the preliminary Python script. The script used a domain-mapping routine to assign language codes, clear the URLs, and generate an XML tree. The preliminary output was removed from good.

Part 4: The iteration (the place the true work occurs)

In case you anticipate AI to ship a flawless, edge-case-proof script on the primary attempt, you’ll be disenchanted. You’ve in all probability heard the comparability of AI instruments to interns, which means it is advisable verify their work. That’s very true.

The true worth of AI lies within the iteration. As we ran the script, we encountered a number of unmatched URLs, leaving pages orphaned fairly than grouping them with their worldwide counterparts.

Right here’s how I iteratively skilled the AI to deal with the nuances of human-managed web sites.

The listing flattening downside

The U.S. web site had not too long ago reorganized its weblog into topical folders, whereas the Mexican and Italian websites hadn’t but been reorganized.

I prompted Gemini with these particular mismatched examples. It responded by including a URL flattener operate to the script, which stripped the topical folders behind the scenes so the translated slugs might align cleanly.

The aggressive semantic entice

To stop the AI from mixing up totally different matters, we carried out idea traps. Initially, they have been too strict. A UK article in regards to the manufacturing sector wouldn’t match an Italian article as a result of the U.S. title was barely extra generic.

I instructed Gemini to loosen the traps for generic industries whereas maintaining them strictly enforced for essential acronyms (reminiscent of “website positioning” versus “SEM”). This gave the AI the respiratory room it wanted to match artistic translations.

The translated slug epiphany

The most important breakthrough got here whereas auditing the Mexican weblog orphans. For instance, the Spanish URL /detras-de-escenas-historias... is a direct translation of the English /behind-the-scenes-stories... I pointed this out to Gemini.

As a substitute of forcing me to hard-code tons of of guide matches, Gemini up to date the script to create a “Mixed Semantic Signature.” It dynamically translated core operational phrases within the slugs, successfully bridging the language hole for the semantic matching mannequin and connecting dozens of orphaned pages virtually immediately.

Dig deeper: Cultural SEO: A practical framework for Spanish markets in AI search

The challenge strengthened a easy lesson: AI works greatest when it’s handled as a collaborator fairly than a shortcut.

  • Be the strategist, let AI be the coder: Don’t simply demand a remaining product. Focus on the structure, edge circumstances, and logic first. Deal with AI like a junior developer that wants clear architectural course.
  • Present concrete examples: When a script fails, don’t simply say, “It’s damaged.” For this challenge, I offered both actual URLs that failed and the URLs they need to have matched with, or teams of URLs with mismatches. AI wants concrete patterns to repair its logic.
  • Embrace the iterative loop: Count on to run the code, establish anomalies, and feed them again into the immediate. Every iteration makes the software considerably smarter.
  • Leverage Google Colab: You don’t should be a Python professional to make use of Python for website positioning. Colab bridges the technical hole, permitting you to run advanced knowledge science libraries instantly in your browser.

By the tip of the challenge, we had a sturdy, extremely personalized Python script that might course of an enormous CSV and generate a cross-referenced hreflang XML sitemap in minutes.

AI isn’t going to interchange technical SEOs anytime quickly. Nonetheless, SEOs who know how you can collaborate with AI to construct customized, scalable, and helpful instruments may have a big benefit.

Dig deeper: How AI search defines market relevance beyond hreflang

Contributing authors are invited to create content material for Search Engine Land and are chosen for his or her experience and contribution to the search neighborhood. Our contributors work below the oversight of the editorial staff and contributions are checked for high quality and relevance to our readers. Search Engine Land is owned by Semrush. Contributor was not requested to make any direct or oblique mentions of Semrush. The opinions they categorical are their very own.


#helped #construct #hreflang #XML #sitemaps #scale

Leave a Reply

Your email address will not be published. Required fields are marked *