Taming the Spiders: Optimize Your Crawl Budget to Boost Indexation and Rankings
Crawl budget is one of the SEO concepts that do not seem to be paid enough attention. A lot of us have heard about it, but mostly, we tend to accept crawl budget as it is, presuming we have been assigned a certain crawl quota that we have little to no impact over.
Or do we? Most webmasters should not worry a lot about crawl rate. But if you run a large-scale website, crawl budget is something we can — and should — optimize for SEO success.
Of course, as things go with SEO, the relationship between crawl budget and rankings isn't straightforward. In January 2017 Google published a post in Webmaster Central Blog, where the search engine made it clear that crawling itself is not a ranking factor. But in a way, crawl budget is important for SEO.
In this guide, I'm going to walk you through the basic crawling-related concepts, the mechanics behind how search engines assign crawl budgets to websites, and tips to help you make the best use of your crawl budget to maximize rankings and organic traffic.
Web spiders: the good and the bad
Web spiders, crawlers, or bots, are computer programs that continuously "visit" and crawl web pages to collect certain information from and about them.
Depending on the purpose of the crawling, one may distinguish the following types of spiders:
- Search engine spiders,
- Web services' spiders,
- Hacker spiders.
Search engine spiders are managed by the search engines like Google, Yahoo, or Bing. Such spiders download whatever webpages they can find, and feed them to the search engine's index.
Many web services such as SEO tools, shopping, travelling, and coupon websites, have their own web indexes and spiders. For example, WebMeUp has a spider named Blexbot. Blexbot crawls up to 15 billion pages daily to collect backlink data and feed that data into its link index (the one used in SEO SpyGlass).
Hackers breed spiders too. They use the spiders to test websites against various vulnerabilities. Once they find a loophole, they may try to get access to your website or server.
You might hear people talk about good and bad spiders. I distinguish them this way: any spiders that aim to collect the information with illegitimate purposes are bad. All the rest are good.
Most spiders identify themselves with the help of the user agent string and provide the URL where you can learn more about the spider:
- Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) or
- Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/).
In this article, I'll focus on search engine spiders and how they crawl websites.
Understanding crawl budget
Crawl budget is the number of times a search engine spider hits your website during a given period of time. For example, Googlebot typically hits my site about 1,000 times a month, I can say that 1K is my monthly crawl budget for Google. Mind that there is no universal limit as to the number and frequency of these crawls; we'll get to factors that form your crawl budget in a moment.
Why does crawl budget matter?
Quite logically, you should be concerned with crawl budget because you want Google to discover as many of your site's important pages as possible. You also want it to find new content on your site quickly. The bigger your crawl budget (and the smarter your management of it), the faster this will happen.
Determining your crawl budget
Let's say, you need to determine your Google crawl budget. Log in to your Search Console account and go to Crawl -> Crawl Stats. Here, you'll see the average number of your site's pages crawled per day.
From the report above, I can see that on average, Google crawls 32 pages of my site per day. From that, I can figure out that my monthly crawl budget is 32*30=960.
Of course, that number is prone to change and fluctuation. But it'll give you a solid idea on how many pages of your site you can expect to be crawled in a given time period.
If you need a more detailed breakdown of your crawl stats by individual pages, you'll have to analyze the spiders' footprints in your server logs. The location of the log files depends on the server configuration. Apache typically stores them in one of these locations:
If you're not sure how to get access to the server logs, seek help from your system administrator or hosting provider.
Raw log files are hard to read and analyze. To make sense of those, you'll either need Jedi-level regular expressions skills, or specialized tools. I prefer to use WebLogExpert (they have a 30-day trial version).
How is crawl budget assigned?
As SEO goes, we don't exactly know how search engines form crawl budget for sites. According to Google, the search engine takes into account two factors to determine crawl budget:
- Popularity — more popular pages get crawled more often, and
- Staleness — Google doesn't let the information about the pages get stale. For webmasters it means that if the content of a page is updated often, Google attempts to crawl the page more frequently.
It looks like Google uses the term popularity to substitute now obsolete PageRank.
Back in 2010, Google's Matt Cutts said the following on the subject:
Though PageRank is no longer publicly updated, it is still safe to assume that a site's crawl budget is largely proportional to the number of backlinks and the site's importance in Google's eyes — it's only logical that Google is looking to ensure the most important pages remain the freshest in its index.
What about internal links? Can you increase the crawl rate of a particular page by pointing more internal links to it?
In order to answer these questions, I decided to check the correlation between both internal and external links and crawl stats. I collected data for 11 websites and performed a simple analysis. Briefly, here's what I did.
With Website Auditor, I created projects for the 11 sites I was going to analyze. I calculated the number of internal links pointing to every page of each of these sites. Next, I ran SEO Spyglass and created projects for the same 11 sites. In every project, I checked Statistics and copied the Anchor URLs with the number of external links pointing to every page. Then, I analyzed the crawl stats in the server logs to see how often Googlebot hits each page. Finally, I put all that data into a spreadsheet and calculated the correlation between internal links and crawl budget, and external links and crawl budget.
I found something pretty interesting. Here's an example spreadsheet for one of the sites I analyzed:
My data set proved that there is a strong correlation (0,978) between the number of spider visits and the number of external links. At the same time, the correlation between spider hits and internal links proved to be very weak (0,154). This shows that backlinks are a lot more important for website crawling than internal linking.
Does it mean that the only way to boost your crawl budget is to build links and publish fresh content? If we're talking about the budget for your entire site, I'd say yes: grow your link profile and update the website often, and your site's crawl budget will grow proportionally. But when we take individual pages, that's where it gets interesting. As you'll find out in the how-tos below, you might be wasting a lot of your crawl budget without even realizing it. By managing your budget in a smart way, you can often double the crawl count for individual pages — but it'll still be proportional to each page's number of backlinks.
How to: make the most of your crawl budget
Now that we've figured out that crawling is important for indexation, isn't it time to focus on the best ways to manage your crawl budget for the ultimate SEO joy?
There are quite a few of things you should (or should not) do to let search spiders consume more pages of your website and do it more often. Here is an action list for maximizing the power of your crawl budget:
1. Make sure important pages are crawlable, and content that won't provide value if found in search is blocked.
Website Auditor is great for creating and managing robots.txt files.
Here's a quick how-to:
- Run the tool (if you still don't have Website Auditor, you can download it for free here) and create or open a project.
- Navigate to the Pages tab, and click on the Robots.txt icon. You'll see the current contents of your robots.txt file.
- To add a new rule to your robots.txt, click Add rule. The software will let you choose an instruction (Disallow or Allow), a spider (you can either enter its name manually or select from a list of the most widespread search bots), and a URL or directory that you need to block.
- Similarly, you can delete and edit the existing rules, too.
- When you're done editing, click Next and either save the file to your hard drive or upload it to your site via FTP right away.
Back in the Pages module, you'll also get lots of crawling-related stats, such as cache date for Google, Bing, and Yahoo, robots.txt instructions, and HTTP status code.
Keep in mind that the search engine spiders do not always respect the instructions contained in robots.txt. Have you ever seen a snippet like this in Google?
Though this page is blocked in robots.txt, Google does know about it. It doesn't cache it or create a standard snippet for it. Still, it occasionally hits it. Here's what Google says on the matter:
Also, If you disallow large areas of your website by blocking folders or using wildcard instructions, Googlebot may assume that you've made it by mistake and still crawl some pages from the restricted areas.
So if you're trying to save up your crawl budget and block individual pages you don't consider important, use robots.txt. But if you don't want Google to know about a page at all — use meta tags.
2. Avoid long redirect chains.
If there's an unreasonable number of 301 and 302 redirects in a row on your site, the search spiders will stop following the redirects at some point, and the destination page may not get crawled. More to that, each redirected URL is a waste of a "unit" of your crawl budget. Make sure you use redirects no more than twice in a row, and only when it is absolutely necessary.
You can get a full list of pages with redirects in WebSite Auditor.
- Open your project and go to the Site Audit module.
- Click on Pages with 302 redirect and Pages with 301 redirect for a full list of redirected pages.
- Click on Pages with long redirect chains to get a list of URLs with more than 2 redirects.
3. Manage URL parameters.
Popular content management systems generate lots of dynamic URLs that in fact lead to one and the same page. By default, search engine bots will treat these URLs as separate pages; as a result, you may be both wasting your crawl budget and, potentially, breeding content duplication concerns.
If your website's engine or CMS adds parameters to URLs that do not influence the content of the pages, make sure you let Googlebot know about it by adding these parameters in your Google Search Console account, under Crawl -> URL Parameters.
4. Find and fix HTTP errors.
Any URL that Google fetches, including CSS and Java Script, consumes one unit of your crawl budget. You don't want to waste it on 404 or 503 pages, do you? Take a moment to test your site for any broken links or server errors and fix those as soon as you can.
- In your Website Auditor project, go to Site Structure > Site Audit.
- Сlick on the Broken links factor. In the right hand pane, you'll see a list of broken links on your site you'll need to fix, if any.
- Then click on Resources with 4xx status code and Resources with 5xx status code to get a list of resources that return HTTP errors.
5. Make use of RSS.
From what I observe, RSS feeds are among the top visited pages by Google spider. If a certain section on your website is often updated (a blog, a featured products page, a new arrivals section), make sure to create an RSS feed for it and submit it to Google's Feed Burner. Remember to keep RSS feeds free from non-canonical, blocked from indexation or 404 pages.
6. Keep your sitemap clean and up-to-date.
XML sitemaps are important for proper website crawling. They tell search engines about the organization of your content, and let search bots discover new content faster. Your XML sitemap should be regularly updated and free from garbage (4xx pages, non-canonical pages, URLs that redirect to other pages, and pages that are blocked from indexation).
You can get a list of such URLs in Website Auditor and easily exclude them from your sitemap.
- In your WebSite Auditor project, go to the Site Audit module.
- Click on Pages with 4xx status code for a list of 4xx pages, if any. Copy the URLs to a separate file (a spreadsheet or any regular text editor will do).
- Click on Pages with 301 redirect for a list of 301 pages. Copy those, too.
- Do the same for Pages with 302 redirect.
- Click on Pages with rel='canonical' for a list of canonical and non-canonical pages. Add these URLs to your list as well.
Website Auditor also has a handy XML sitemap generator. Just click on Sitemap to start building your XML sitemap.
- Use the quick filter to find the 4xx, 3xx, and non-canonical URLs you have just copied, and uncheck the boxes next to those pages.
- Adjust Priority and Change Frequency. These settings are optional, but they may help you direct search bots to the more important and more frequently updated pages of your site. For example, you'd typically give the highest priority to your home page, then category pages, then subcategories.
Change frequency describes how frequently your page is updated to give crawlers and idea when each page or directory is likely to change and should be revisited again.
If you run a large website that has many subsections, it is useful to create a separate sitemap for each subsection. This will make managing your sitemap easier and let you quickly detect areas of the website where crawling issues occur. For example, you may have a sitemap for the discussion board, another sitemap for the blog, and one more sitemap to cover main website pages. For e-commerce websites, it's wise to create individual sitemaps for large product categories.
Make sure all the sitemaps are discoverable by the spiders. You can include links to the sitemaps in robots.txt and register them in the Search Console.
7. Take care of your site structure and internal linking.
Though internal linking doesn't have a direct correlation with your crawl budget, site structure is still an important factor in making your content discoverable by search bots. A logical tree-like website struture has many benefits — such as user experience and the amount of time your visitors will spend on your site — and improved crawling is definitely one of them.
In general, keeping important areas of your site no farther than 3 clicks aways from any page is good advice. Include the most important pages and categories in your site menu or footer. For bigger sites, like blogs and e-commerce websites, sections with related posts/products and featured posts/products can be a great help in putting your landing pages out there — both for users and search engine bots.
If you need the detailed instructions, I highly recommend you to get through this internal linking guide.
As you can see, SEO is not all about 'valuable content' and 'reputable links'. When the foreside of your website looks polished, it may be time to go down to the cellar and do some spider hunting — it's sure to work wonders in improving your site's performance in search.
Now that you have all the necessary instruments and knowledge for taming search engine spiders, go on and test it on your own site, and please share the results in the comments!
P.S.: Oh, and here's a cute baby spider to brighten up your day:
Head of SEO at SEO PowerSuite