B BigSERPEnergy

Crawl budget on large sites: what actually matters

Crawl budget only bites when the URLs you could crawl exceed what gets crawled. Most of the waste is self-inflicted, and most of it is fixable.

Rows of server blades in a data center
Photo: panumas nikhomkhai / Pexels

On a site with a few hundred pages, crawl budget is not your problem. Google will happily crawl everything you publish, often within hours. The concept only starts to matter when the number of URLs a search engine could crawl exceeds the number it is willing to crawl in a reasonable window. That is the world large sites live in, and it changes how you think about technical SEO.

Google describes crawl budget as the product of two things: how much it is able to crawl without straining your servers, the crawl capacity, and how much it wants to crawl based on how valuable and fresh your URLs appear, the crawl demand. You influence both. This article is about doing that on purpose.

The waste is usually self-inflicted

When crawlers spend their time on the wrong URLs, the cause is almost always something the site is generating itself. The usual suspects:

  • Faceted navigation. Filters and sort orders that combine into a near-infinite space of URLs, most of which are duplicates or near-duplicates of each other.
  • Session identifiers and tracking parameters that turn one page into many addresses.
  • Infinite or very deep pagination that invites crawlers into long, low-value tails.
  • Soft 404s and redirect chains that burn requests without delivering anything indexable.

The first job is not to make crawling faster, it is to stop wasting it. Every request a crawler spends on a parameter-laced duplicate is a request it did not spend on the page you actually want indexed and refreshed.

Read your logs

Server log files are the only source that tells you what crawlers actually did, not what you hope they did. Before changing anything, look at which URLs and templates are consuming crawl requests, and how much of that is going to pages you would be happy to lose.

Controls, and what they really do

The tools for steering crawlers are simple, but they are easy to misuse because they do different jobs that people often conflate.

robots.txt

Disallowing a path stops crawling, not indexing. A URL blocked in robots.txt can still appear in results if other pages link to it, just without a useful snippet. Use it to keep crawlers out of low-value spaces like internal search results, not as a way to remove pages from the index.

noindex

A noindex directive is how you keep a page out of the index. The catch is that a crawler has to be able to fetch the page to see the directive, so a URL you both disallow in robots.txt and want noindexed will never have its noindex read. Pick one job per URL and apply the matching tool.

Canonical tags

Canonicals consolidate signals among genuine duplicates, telling search engines which version to prefer. They are a hint, not a command, and they do not save crawl budget on their own, since the duplicates still get crawled. They are about consolidating authority, not about reducing crawl.

Make the important URLs easy and the rest invisible

The strategy that follows from all this is straightforward to state and hard to execute. Make the URLs you care about easy to discover, through clean internal links, sensible sitemaps, and shallow depth. Make the URLs you do not care about hard to reach or impossible to index, by trimming parameters, blocking low-value spaces, and capping pagination depth.

Sitemaps deserve a note. On a large site, a clean, segmented set of sitemaps is both a discovery aid and a diagnostic tool. If you split sitemaps by template or section, the index-coverage data in Search Console tells you which parts of the site are being indexed well and which are not, which is far more actionable than one number for the whole domain.

Rendering is part of the budget

If your pages depend on client-side JavaScript to show their main content, you are asking crawlers to do more work per URL, and that work competes for the same capacity. The safest position on a large site is to serve the content that matters in the initial HTML, through server-side rendering or static generation, so that a crawler does not have to execute and render a page to understand it. Heavy client-side rendering at scale is a reliable way to slow discovery without anyone noticing why.

Takeaways
  • Crawl budget only matters when crawlable URLs exceed what gets crawled, which is normal at scale.
  • Most waste is self-inflicted by facets, parameters, and deep pagination. Stop the waste first.
  • robots.txt blocks crawling, noindex blocks indexing, canonicals consolidate duplicates. Do not mix up their jobs.
  • Serve the important content in the initial HTML so crawlers spend less per URL.

Technical SEO at scale is less about clever tricks than about discipline: knowing what crawlers are doing, removing the work that does not need doing, and pointing the budget at the pages that pay you back.

Leave a Reply

Your email address will not be published. Required fields are marked *