How to fix SaaS crawl budget: a diagnostic guide. Block leaked app and staging URLs at the source. PropSaaS Growth.

Most SaaS companies chasing a crawl budget problem are solving the wrong thing. Googlebot's crawl rate is rarely the constraint for a marketing site of a few hundred or a few thousand pages. The real issue is junk URLs leaking in from systems that were never meant to be indexed: the app domain, the staging environment, the dynamic product interface. Fix the source of the junk and the indexation problem follows.

Google's own crawl budget guidance applies primarily to sites with a million or more unique pages that change weekly, or 10,000-plus pages that change daily. Most SaaS marketing sites do not hit these thresholds. Leaked app and staging URLs are how they get pushed over anyway, often without anyone on the marketing team realizing the leak exists.

This article is a triage guide, not a generic tips list. You will get a diagnostic framework to identify which crawl budget problem you actually have, the six URL patterns unique to SaaS product architectures, the specific fix that maps to each source, and the connection between crawl budget and AI search visibility that most teams are missing.

The marketing site is almost never the source of SaaS crawl waste. The app subdomain, the docs site, and the staging environment are. Block the source, and the marketing site indexes the way it should.

Does crawl budget actually matter for your SaaS?

Crawl budget matters for your SaaS if you have more than 10,000 pages changing daily, a growing "Discovered - currently not indexed" population in GSC, or evidence that app, staging, or documentation URLs are reaching Google's crawl queue. For a 200-page marketing site on a fast server, crawl budget is almost certainly not your problem, and time spent optimizing it is time taken from work that would move rankings.

Google's documentation is explicit that crawl budget management is intended for very large or frequently changing sites. As Google's crawl budget guide puts it, if your site does not have a large number of pages that change rapidly, or your pages tend to be crawled the same day they are published, you do not need to worry about it. Several popular SEO guides overstate the problem and imply that every site needs crawl budget optimization. That is not what Google says.

The SaaS exception is what makes the topic worth a guide. Even a small marketing site can develop a real crawl budget constraint when the product architecture leaks URLs at scale. A 3,000-page marketing site is fine on its own. The same company can quietly be exposing tens of thousands of app, session, and staging URLs to Googlebot, and those URLs compete for the same crawl capacity.

Three diagnostic signals indicate a crawl budget problem regardless of page count:

  • A rising "Discovered - currently not indexed" count in the GSC Pages report.
  • Erratic or declining crawl volume in the GSC Crawl Stats report.
  • Slow indexation of new content that you know is internally linked and in the sitemap.

If none of those signals is present, you can stop here. If one or more is, the rest of this guide is the triage path.

How Google defines crawl budget

Google defines crawl budget as the product of two variables: crawl capacity limit and crawl demand. Crawl capacity limit is how fast Googlebot can crawl without overloading your server. Crawl demand is how much Google actually wants to crawl your content, based on its popularity, freshness, and perceived inventory. Most guides reduce this to "the number of pages Google crawls per day," and that framing leads directly to the wrong fixes.

The two-variable model matters because it tells you which lever to pull. Crawl budget is not a fixed daily page count. It fluctuates with your server's response health and the perceived importance of your content. A slow time to first byte (TTFB) can cut Googlebot's crawl capacity on any given day, independent of your page count, which is one reason server performance and crawl efficiency are linked.

What crawl capacity limit means in practice

Crawl capacity is governed by how your server responds. Fast, reliable responses let Googlebot crawl more. Slow responses, timeouts, and 5xx errors cause Googlebot to back off to avoid degrading your site for real users. This is why improving server response times, and the broader work of improving Core Web Vitals, compounds beyond user experience into crawl efficiency.

What crawl demand means in practice

Crawl demand reflects how much Google wants your content. Popular URLs are crawled more often to keep them fresh. URLs Google considers stale or low-value are crawled less. Perceived inventory matters too: if Google believes your site has a large set of URLs worth tracking, it allocates demand to discovering and refreshing them. Junk URLs inflate perceived inventory and pull demand toward pages that should not exist.

One distinction worth holding onto: crawl budget governs whether a page gets crawled, and a separate set of signals governs whether a crawled page gets indexed. A page can be crawled and still not indexed if Google judges it low quality or duplicative. That difference becomes the whole game in the diagnostic framework below.

How to check crawl budget in Google Search Console

To check crawl budget, open Google Search Console, go to Settings, then Crawl Stats under Crawling. The report shows total crawl requests, response codes, file types crawled, and Googlebot's crawl breakdown by purpose, with up to 90 days of history. The key number is not total requests. It is the ratio of successful crawls to errors and redirects, and the trend line in daily crawl volume.

Here is what to read, in order:

  • Total crawl requests over time. A steadily declining trend is a flag worth investigating, especially alongside the next signal.
  • By response. A high or rising share of 404s and redirects (301/302) means Googlebot is spending capacity on URLs that lead nowhere useful.
  • By file type and by Googlebot type. Disproportionate crawling of one file type or one bot category can point to a specific leak.
  • Cross-reference the Pages (Index Coverage) report. A large or growing "Discovered - currently not indexed" count next to declining crawl volume is the pattern that confirms a capacity constraint.

For SaaS companies, watch for property anomalies. If you have separate GSC properties for app.yourproduct.com and docs.yourproduct.com, check each one. A leak usually shows up as a surprising indexed URL count on a property the marketing team forgot existed.

One limitation to plan around: GSC Crawl Stats does not break crawl activity down by URL path. To see exactly which paths Googlebot is spending capacity on, you need server log analysis. GSC tells you that a problem exists. Server logs tell you where it lives.

The SaaS crawl budget diagnostic framework

Before applying any fix, determine which problem you have. Most SaaS crawl budget issues trace to one of three root causes: URL architecture generating junk at scale, technical debt from migrations and infrastructure changes, and server capacity consumed by the wrong bots. Each requires a different fix, and applying the wrong one is how remediation efforts burn months without moving the needle.

Step 1: Read the GSC crawl stats signal

Look at Crawl Stats alongside the Pages report and match your situation to one of three signal combinations:

  • Stable crawl volume, rising "Discovered - currently not indexed": Google is finding your URLs but choosing not to prioritize them. This points to URL architecture or page authority, not raw capacity.
  • Declining crawl volume, rising "Discovered - currently not indexed": a capacity constraint. Googlebot is being throttled by server load, URL bloat, or bot competition, and important pages are queued behind junk.
  • Both stable: crawl budget is not your current problem. Spend your time elsewhere.

Export the crawl stats for trend analysis so you can see the direction of travel rather than a single snapshot. The trend is the signal.

Step 2: Identify your URL waste pattern

Pull a crawl with Screaming Frog or a server log export and filter for URLs that are being crawled but should not be. On SaaS sites, these almost always come from one of six sources: app subdomain leakage, per-session URLs, infinite pagination, multi-tenant subdomains, auto-generated docs, or staging environments. Identifying the source determines the fix, so name it before you touch anything.

Server logs give the most complete picture, because GSC does not show every URL Googlebot requests. Filter by path prefix, check which paths have robots.txt coverage, and look for URL patterns that repeat across sessions or tenants. The next section describes each of the six sources in detail.

Step 3: Map the root cause to a fix

Once you have named the URL waste pattern, the fix is usually one of four mechanisms. Match the fix to the source, not to the symptom.

Root cause Fix mechanism Priority
App subdomain, staging, auto-generated docs reachable by Googlebotrobots.txt Disallow at the source (or CDN-level block)High
Near-duplicate programmatic or parameter pagesCanonical tags consolidating to the clean URLMedium
Junk URLs reachable through accidental internal linksInternal link pruning to remove the crawl pathwayMedium
Non-Googlebot crawlers consuming server capacityRate limiting by bot category at the CDNSituational

One trap to avoid: applying noindex to pages that should never have been crawled wastes Googlebot's time twice, once on the crawl request and once on processing the directive. For URLs that should not be crawled at all, block at the source with robots.txt. The full noindex versus robots.txt distinction has its own section below.

The six SaaS-specific crawl budget killers

Every generic crawl budget guide covers duplicate content and redirect chains. What they miss are the six URL patterns unique to SaaS product architectures. None of these appear in e-commerce or publishing SEO guides, because they come from product engineering decisions rather than content management. They are: app subdomain leakage, per-session and auth-gated URLs, infinite dashboard pagination, multi-tenant subdomain fragmentation, auto-generated documentation trees, and accessible staging environments.

App subdomain leakage

If your product lives at app.yourproduct.com and that subdomain has no robots.txt blocking Googlebot, every URL inside your authenticated product interface is potentially in Google's crawl queue. Session-specific dashboard views, settings pages, and dynamically generated reports are all crawlable. Google will not index most of them, because they return 401 or redirect to a login, but it will spend crawl budget trying.

Crawl budget is defined per hostname, so app.yourproduct.com has its own separate budget from www.yourproduct.com. To check exposure, look for the app subdomain as a separate property in GSC and inspect its indexed and discovered URL counts. The fix is a robots.txt with Disallow: / on the app subdomain, or a rule at the CDN that blocks Googlebot for that host. Verify with the robots.txt tester and a few URL Inspection checks afterward.

Per-session and auth-gated URLs

Some SaaS products put session tokens, user IDs, or auth parameters directly in the URL path rather than in query strings. A URL like /dashboard/session/a1b2c3d4e5/reports is technically unique and crawlable, even though it returns nothing useful once the session expires. At scale, these fill the crawl queue silently, because each appearance in an accessible link is a new entry for Googlebot to chase.

These are harder to handle than query-string parameters, because path-based tokens cannot be managed with parameter rules. Identify the pattern in server logs, then fix it with path-based robots.txt rules, canonical tags pointing to clean URLs where pages must stay crawlable, and ideally an engineering change to remove session tokens from paths entirely.

Infinite pagination in dashboards

A reporting dashboard that accepts page, sort, and filter parameters can generate thousands of unique URL combinations from a handful of real reports. A dashboard with 10 sortable columns and 5 page options produces 50 URL variants from a single report view. Multiply by report types and you have hundreds of crawlable URLs from one product feature, all of which Googlebot may follow if any link in the chain is public.

This is the faceted navigation combinatorial explosion problem, and it applies to SaaS filter and search pages, not just e-commerce. The fix is layered: parameter handling for query-string pagination, robots.txt path rules for path-based pagination, and canonical tags on paginated views pointing to the first page so ranking signals consolidate.

Multi-tenant subdomain fragmentation

If your product gives each customer a subdomain (customer1.yourproduct.com, customer2.yourproduct.com), each subdomain carries its own crawl budget from Google's perspective. A product with 500 active tenant subdomains, all publicly reachable and none blocking Googlebot, is effectively hosting 500 separate crawl budget accounts, most of which consume crawl capacity without contributing to the marketing site's authority.

The fixes depend on intent. For tenants that are not meant to be public, use password protection or a robots.txt block at the tenant level. For products where public tenant pages have value, consider whether subpaths (/customer/) instead of subdomains would consolidate authority and crawl signals onto one hostname. Fragmented subdomains dilute both crawl budget and brand entity consolidation.

Auto-generated documentation URLs

API documentation generators create large URL trees that are often crawlable and often thin. An OpenAPI spec with 200 endpoints can generate 200 or more documentation URLs, many containing nearly identical boilerplate with only the endpoint path and parameter name changing. A documented REST API can produce anywhere from a few hundred to several thousand such URLs, and if they sit under docs.yourproduct.com with no robots.txt, Googlebot will crawl all of them.

The distinction to draw is between valuable documentation (conceptual guides, tutorials, quickstarts) and thin auto-generated reference pages. Check docs.yourproduct.com/robots.txt first. Then block the auto-generated reference paths, canonicalize near-duplicate reference pages to a primary page, and reserve noindex for thin pages that must stay crawlable for other tools. Thin docs pages consume crawl and index capacity without earning anything back.

Staging and development environment leaks

Staging environments without password protection and without a robots.txt Disallow: / are crawlable by Googlebot. If staging.yourproduct.com, dev.yourproduct.com, or a numbered deploy-preview URL is linked anywhere public, Googlebot can find it. Near-duplicate staging content dilutes crawl budget and can create duplicate content signals against your production site.

Check exposure by visiting the staging robots.txt and seeing whether it returns Disallow: /. Googlebot discovers staging URLs through developer previews shared in public docs, links posted in indexed forums, and accidental inclusion in sitemaps. The fixes are basic auth on all non-production environments, a full robots.txt block, CDN IP restriction to office or VPN ranges, and removing staging links from anywhere public. Confirm with URL Inspection that the staging host is not indexed.

URL parameters, faceted navigation, and session IDs for SaaS

URL parameters that generate unique crawlable URLs without unique content are one of the most common crawl budget drains. For SaaS sites, the parameters that matter are filter states on search and directory pages, sort parameters in feature lists, pagination parameters in dashboards, and tracking parameters appended by marketing tools such as utm_source and sessionid.

There are three parameter categories to handle, and they call for different treatments:

  • Tracking and marketing parameters (utm_*, click IDs): they create endless URL variants of the same page. Consolidate with a self-referencing canonical tag on the clean URL.
  • Session parameters: they should not be crawlable at all. Block patterns with robots.txt and remove them from any public links.
  • UI state parameters (sort, filter, page): handle by canonical to the representative URL, with robots.txt wildcard rules (for example Disallow: /*?page=) where the variants have no standalone value.

Canonical tags are the primary tool: set the canonical on each parameter URL to point at the clean version so ranking signals consolidate. Use robots.txt for parameters that should never be crawled. Test that handling works using URL Inspection, which shows Google's chosen canonical for any URL.

One SaaS-specific nuance worth stating plainly: do not blanket-block filter and search parameters that generate genuinely unique, high-value pages. An integration directory filtered by category, for example, is often a valuable landing page in its own right. Blocking it to save crawl budget would cost you rankings. Distinguish parameters that create value from parameters that create noise before you write the rule.

From "Discovered but not indexed" to fixed

"Discovered - currently not indexed" means Google found a URL but has not crawled it yet. It is queued and deprioritized. That usually means one of three things: Google does not consider the page important enough to crawl (low internal link equity), crawl budget for your site is being consumed by higher-priority or junk URLs, or the page was discovered too recently to have been reached.

The important reframe is this: a page stuck in "Discovered - currently not indexed" is usually a crawl prioritization problem, not a content quality problem. Rewriting the page will not move it. Reducing URL waste elsewhere, or increasing internal link equity to the page, will. (Contrast this with "Crawled - currently not indexed," which does point to quality or duplication, because Google crawled the page and still declined to index it.)

The fix pathway has three moves, usually run together:

  • Reduce URL waste using the six-killer diagnosis above, to free crawl budget for the pages you care about.
  • Add strong internal links from high-authority pages to raise crawl demand on the stuck URL.
  • Confirm the URL is in the XML sitemap with an accurate lastmod timestamp, which is a crawl demand signal rather than a guarantee.

The sitemap clarification matters. A sitemap signals which pages you consider important, but Googlebot still makes its own crawl decisions. An updated lastmod nudges crawl demand; it does not force a crawl. Internal links from pages Google already values consistently do more work, which is why orphaned content struggles even when it sits in the sitemap. For the mechanics of finding and re-linking those pages, see the guide on how to find orphan pages.

noindex vs. robots.txt disallow: which one and when

Use robots.txt Disallow to stop Googlebot from crawling URLs it should never see. Use noindex to stop Googlebot from indexing pages it is allowed to crawl. The critical distinction: if you put noindex on a URL you have not blocked in robots.txt, Googlebot still crawls the page, then sees the noindex tag, then drops it. You spent crawl budget to accomplish nothing.

This is not interpretation, it is Google's stated guidance. The crawl budget documentation says directly: do not use noindex to manage crawl budget, because Google will still request the page, then drop it when it sees the noindex meta tag or header, wasting crawling time. For crawl efficiency, the source-level block is what you want.

The mapping for SaaS URL types:

URL type Recommended approach Why
App subdomain, staging, dev environmentsrobots.txt Disallow (plus basic auth)Should never be crawled; block at the source
Session and auth-gated path URLsrobots.txt path rulesNo value to any crawler; stop the request
Internal search results you want reachable but not indexednoindex (kept crawlable)Needs crawling for link equity or UX, but not indexing
Near-duplicate pages with one preferred versionCanonical tagConsolidate ranking signals to one URL
Paginated series where only page 1 should rankCanonical to page 1 (crawlable)Let Googlebot follow the chain, index the head

The combination matters. For a parameterized or paginated series you want Googlebot to traverse but not index beyond the first page, canonical (not robots.txt) is correct, because a disallowed URL cannot pass or consolidate signals. Reserve robots.txt for URLs that have no business being crawled at all.

How internal linking and sitemaps protect crawl budget

Blocking junk URLs frees up crawl budget; internal linking and sitemaps direct that freed capacity toward the pages that matter. A page with no internal links pointing to it depends entirely on its sitemap entry for crawl attention. A page with strong internal links from high-traffic pages gets crawled and indexed whether or not it is in the sitemap. Both halves of the work matter: stop the waste, then steer the capacity.

Crawl depth is the lever most teams underuse. Pages buried more than three clicks from the homepage are significantly less likely to be crawled and indexed, regardless of content quality. Flattening your link architecture so important pages sit within three clicks is a direct crawl budget intervention, and it is the technical SEO outcome that internal linking for SaaS topic clusters is designed to control.

A few rules that hold up across SaaS sites:

  • Keep every page that matters within three clicks of the homepage. Audit depth with Screaming Frog and pull anything important that sits at depth 4 or deeper.
  • Treat the sitemap as a signal, not a guarantee. Keep lastmod accurate and exclude noindex and non-canonical URLs so the sitemap stays a clean statement of what you want crawled.
  • Hunt orphan pages. Pages with zero internal links are discoverable only via sitemap and tend to languish in "Discovered - currently not indexed."

One pattern is specific to SaaS: product feature pages are often built by the product team and added only to the top nav, so they receive almost no contextual internal links from the marketing site. Link them from relevant blog content and from the hub architecture so they earn crawl demand commensurate with their commercial value.

The new crawl budget threat: AI crawlers

AI crawlers from OpenAI, Anthropic, Perplexity, and others now consume server capacity that used to be Googlebot's alone. Automated traffic has crossed a threshold: Cloudflare reported in June 2026 that bots account for roughly 57.5% of HTTP requests to web content worldwide, against 42.5% from people, and AI crawlers are one of the fastest-growing slices of that automated traffic (Cloudflare, 2026).

The mechanism that connects this to crawl budget runs through your server. Googlebot's crawl capacity limit responds to how fast and reliably your server answers. If AI crawlers are consuming a significant share of your server's resources, Googlebot competes for what remains, and a server under load will see Googlebot throttle its crawl rate to avoid making things worse. The same slow TTFB that hurts users hurts every crawler at once.

The good news is that the major AI crawlers are controllable. GPTBot, ClaudeBot, PerplexityBot, and OAI-SearchBot all document support for robots.txt directives, so you can disallow the ones you do not want by user agent. To see what is actually hitting your site, filter server access logs by those user-agent strings and measure their share of requests and the paths they target.

The strategic call is not all-or-nothing. Blocking every AI crawler trades away AEO citation opportunity for server headroom. The more nuanced approach is selective: rate-limit AI crawlers at the CDN (a Cloudflare rule, for instance) rather than blocking them outright, which preserves crawl capacity for Googlebot while keeping your content reachable for the AI engines you do want citing you. Block by path or by specific bot where a crawler is misbehaving, and keep the well-behaved search-oriented bots.

Crawl budget and AI search visibility: the connection no one is talking about

If a page is not indexed by Google, it generally cannot be cited by AI search platforms. ChatGPT, Perplexity, and Google AI Mode draw most of their citations from the indexed web. A crawl budget problem that leaves your best content stuck in "Discovered - currently not indexed" is therefore also an AEO problem, and fixing crawl budget is a prerequisite for AI search visibility rather than a separate workstream.

The compounding effect cuts both ways. A SaaS blog post that answers a real buyer question cannot appear in AI answers if it never gets indexed. Reverse it and the math gets attractive: one technical fix produces two returns. First Google indexes the page, then AI platforms become able to cite it. For a PropTech or FinTech company publishing authoritative content on compliance, product capability, or industry trends, junk URL waste in the crawl queue is the quiet reason that content never shows up where buyers are now asking.

The actionable version: use GSC to find which of your best content candidates are sitting in "Discovered - currently not indexed," and treat those as AEO-critical pages to rescue first. Indexation is the gate. Everything in an AEO strategy assumes the page is on the other side of it, and the work of measuring AI visibility only makes sense once your priority pages are actually crawlable and indexed.

What we check in a SaaS crawl budget audit

When PropSaaS Growth audits a SaaS site's crawl budget, we run the same five-layer sequence regardless of company size or tech stack. The sequence surfaces the same patterns across very different SaaS architectures, because the engineering choices that create crawl waste are consistent. Here is what each layer covers.

  • Layer 1, GSC signal review. Crawl Stats and Pages report over a 60-to-90-day trend: "Discovered - currently not indexed" volume and direction, error and redirect ratios, and crawl-volume trajectory.
  • Layer 2, server log analysis. Googlebot crawl paths versus AI-crawler paths by user agent, which paths consume the most capacity, and TTFB by path so slow templates surface.
  • Layer 3, app and staging check. robots.txt presence and contents on every subdomain, password protection on non-production environments, which hosts exist as GSC properties, and their indexed URL counts.
  • Layer 4, XML sitemap audit. Submitted URL count versus indexed count, any noindex or non-canonical URLs sitting in the sitemap, and lastmod accuracy.
  • Layer 5, internal link depth map. A Screaming Frog crawl to find pages at depth 4 or deeper and to surface orphan pages with zero internal links.

The output is a ranked list of crawl waste by URL volume and category, addressed in order of impact rather than order of discovery. This crawl budget assessment is part of the technical foundation in a PropSaaS Growth technical SEO engagement, alongside structured data, rendering, and internal linking.

Crawl budget tools for SaaS teams

Five tools cover the full crawl budget diagnostic workflow, and you rarely need all five at once. Start with Google Search Console, which is free and always on, and add tools as the diagnosis narrows. GSC is the only one that reports what Googlebot actually did; every other crawl tool simulates crawl behavior.

Tool Free / paid SaaS-specific use
Google Search ConsoleFreeCrawl Stats, Pages report, URL Inspection
Screaming Frog SEO SpiderFree to 500 URLs; paidURL crawl, link-depth mapping, redirect-chain finder
Server access logsInfrastructure accessGooglebot path analysis, AI-crawler identification, TTFB by path
Cloudflare / CDN analyticsVaries by planBot traffic breakdown, rate-limit rules for AI crawlers
BotifyEnterprise, paidLarge-scale log analysis; overkill for most Series A to B SaaS

A note on site-audit crawlers such as Semrush Site Audit: they are good for surface-level issue discovery, but they do not provide actual Googlebot crawl data. Use them to supplement GSC, not to replace it. The question that only GSC and server logs can answer is what Googlebot is really spending its time on.

How to prioritize which pages to protect

You cannot rescue every page at once, so when crawl budget is constrained, prioritize by business impact. The pages that most need crawl protection are the ones that generate pipeline (high-intent landing pages, product features, integrations, comparison pages), the ones actively losing rankings as crawl frequency declines, and the ones that are your strongest candidates for AI citation.

A simple three-tier triage:

  • Tier 1, protect first: commercial-intent pages (pricing, features, integrations, alternatives) and pages ranking in positions 5 to 20 with upward potential. Pages in that range improve most from increased crawl frequency and already carry a traffic signal.
  • Tier 2, protect second: content published in the last 90 days that has not yet indexed, and cornerstone pages with high internal link equity worth defending.
  • Tier 3, lower priority: informational posts with no current rankings and pages with zero impressions in GSC over 90 days.

Layer the AEO consideration on top: for any page you want cited in AI answers, confirm it is indexed before investing in further content work, because an unindexed page earns nothing from being improved. Use the GSC Performance report to find Tier 1 candidates by filtering for pages around positions 5 to 20 with recent impression growth, then make sure crawl budget is working for them rather than against them.

Crawl budget, for most SaaS companies, is not a scale problem. It is a hygiene problem: junk URLs from systems built for other purposes, leaking into a queue meant for your marketing content. Diagnose the signal, name the source, block it where it lives, and steer the freed capacity to the pages that earn pipeline and AI citations. That is the whole job.

Frequently asked questions

Does crawl budget matter for a SaaS company with under 10,000 pages?

Usually not, unless app URLs, staging environments, or multi-tenant subdomains are accessible to Googlebot. Google's guidance is aimed at very large or frequently changing sites. A 5,000-page marketing site on a healthy server is generally fine. A 5,000-page marketing site with 50,000 leaked app URLs has a serious problem regardless of the marketing site's size, because the junk competes for the same crawl capacity.

How does a SaaS app's dynamic URL structure affect crawl budget?

Dynamic URLs from SaaS product interfaces generate unique, crawlable addresses for dashboard states, user sessions, and filter combinations. A dashboard that accepts sort, filter, and page parameters can produce thousands of URL variants from a handful of real views. If any of these are linked from a public page, Googlebot can crawl the chain, spending capacity on pages that return nothing useful once a session expires.

What is the difference between "Discovered" and "Crawled" but not indexed?

"Discovered - currently not indexed" means Google knows the URL exists but has not crawled it yet, usually a crawl prioritization signal. "Crawled - currently not indexed" means Google did crawl the page but chose not to index it, usually a content quality or duplication signal. The first is often fixed by reducing URL waste and adding internal links. The second is fixed by improving or consolidating the page itself.

Should I use noindex or robots.txt to fix crawl budget?

Use robots.txt Disallow to stop Googlebot from crawling URLs it should never see, such as app subdomains and staging environments. Use noindex only for pages that must stay crawlable for another reason. Google states that using noindex to manage crawl budget wastes crawling time, because Googlebot still requests the page before it sees the noindex directive and drops it. Block at the source with robots.txt instead.

Do AI crawlers affect Google crawl budget?

Indirectly, through server load. Googlebot's crawl capacity limit responds to how fast and reliably your server answers. If AI crawlers such as GPTBot, ClaudeBot, and PerplexityBot consume a large share of server resources, Googlebot competes for what remains and may slow its crawl rate. The usual fix is rate limiting at the CDN rather than full blocking, which preserves both crawl capacity and AI citation eligibility.

How does crawl budget affect AI search visibility?

AI search platforms including ChatGPT, Google AI Mode, and Perplexity draw most citations from the indexed web. A page that Googlebot cannot crawl and index is effectively invisible to AI search, regardless of how good the content is. A crawl budget problem that leaves your best content stuck in "Discovered - currently not indexed" is therefore also an AEO problem, and fixing it produces both SEO and AI-citation returns.

Gemma Smith

Gemma Smith, Founder, PropSaaS Growth

Gemma builds organic and AI visibility programs for B2B SaaS companies in PropTech, FinTech, and vertical software categories. 10+ years in PropTech. AirOps Champion.