Using Robots.txt and XML sitemaps are two SEO techniques that don’t get enough air time in our opinion.
By now you probably get how Google —and the other search engines— work, right? Google’s bots crawl your website, page by page, to determine who you are, what your website is about, and where you should sit in the search engines for the keywords you align with.
Keywords, content, and even internal and outbound linking — the on-page SEO techniques — are definitely the cool kids of the SEO world. We talk about them a lot, and they’re easy to dress up and analogise into fun metaphors and quips.
The less appealing but just as important side of SEO is the technical side. We’re looking at Robots.txt and XML sitemaps in this blog — what they are, why they matter, and how you can use them to get better SEO results.
Table of Contents
ToggleThe TL;DR (too long, didn't read) 👇
- Google is a business and crawling websites takes time and costs money.
- Making your website easier to crawl makes your website's pages get indexed and generally pleases the bots.
- XML sitemaps give search engine crawlers a map to follow to crawl your website logically.
- Robots.txt gives the crawlers a list of instructions to make sure your priority pages are crawled first.
- Internal linking makes crawling and navigating around your website easier.
- Schema markup gives search engine crawlers context on your website's pages so it can make quicker, easier decisions.
Google is a business and crawling costs money 🕷
When we think about Google, we tend to think that the tech giant has endless resources to do its thing. In reality, though, Google is a business just like ours and yours.
Crawling costs money, and so, Google assigns what is known as a crawl budget to your website. Each “crawl” might cost 0.0000001 cents but with over 130 trillion pages, it adds up.
These are determined by the size of the website and how important Google perceives your website to be.
Take a local Brisbane plumber and The New York Times, for example. Google’s whole thing is providing timely, correct, and relevant information to its users. So it makes sense that it will prioritise crawling The New York Times for its latest news stories ahead of a plumber in Brisbane.
XML sitemaps & Robots.txt save Google time crawling your website & get you better results 📊
With that crawl budget in mind, this is where XML sitemaps and Robots.txt come into play.
We’ll explain this in more detail in the next section… But your XML sitemap is how you can let Google know which pages exist on your website and which pages it should spend time crawling and indexing to show up on the search engine.
Your website might have loads of pages, but not all of them are relevant to Google.
For example: you might exclude the login customers use to enter your dashboard, your privacy policy, or a thank you page.
Your XML sitemap lives in your Robots.txt. Robots.txt is a document that holds instructions for bots to help them prioritise pages to crawl and shows them how to find their way around your site.
We’ll go into further detail about Robots.txt further down in this post. What you need to know at this stage, is that XML sitemaps and Robots.txt work together to help Google crawl your website easily and index the pages logically.
What is an XML Sitemap? 🗺
An XML sitemap is a map of your website’s pages (URLs) that gives Google all the information it needs to know to index them correctly.
XML sitemaps can specify which pages exist on a website, which should be indexed and which shouldn’t, and it can offer Google extra context to use:
- When the page was last updated
- How important the page is against other pages on the same website. This is known as crawl depth (the more clicks away from the homepage, the less important the page is)
How it helps search engines 🤝
XML sitemaps basically save Google (or other search engines like Bing) time when they’re crawling your website.
When search engines crawl your website, it’s a big exercise in understanding what each page is and where it fits into the broader website.
Your XML sitemap says:
- here are all the pages
- these are the pages that matter
- save yourself the trouble with these pages
By helping Google become more efficient and effective in crawling your website, you can ensure the right pages get indexed and boost your rankings.
Types of XML Sitemaps
There are four main types of XML sitemaps:
- Standard Sitemaps
- Image Sitemaps
- Video Sitemaps
- News Sitemaps
- Mobile Sitemaps
Standard sitemaps
This is the standard XML sitemap we’ve been talking about for your website’s pages and what we generally use on our own website and our clients’.
Image sitemaps
An image sitemap is a sitemap for your website’s images. These allow you to describe the image and its location to help it rank in Google Images.
Video sitemaps
Similar to the image sitemap, this is for videos. It can help your video rank in the ‘Videos’ section of Google or in a Featured Snippet.
News sitemaps
This kind of sitemap is compulsory for news sites. You might think it’s worth having if you run a blog. However, these sitemaps don’t allow URLs that are older than two days, so your blog will need to be pretty up to date.
News sitemaps help articles show up in the ‘News’ section of Google.
Mobile sitemaps
A mobile sitemap is only necessary for websites that have a really different version of their website for mobile devices.
Generally speaking, if your website is responsive and works well on mobile, this sitemap is unnecessary.
How to create an XML Sitemap
So, by now, you’re probably convinced of the magic of the XML sitemap, right? ✨
But how do you create it?
If your website is built on WordPress and you’re using an SEO plugin like RankMath or Yoast, you’ll be able to use your tool to generate it.
If your website lives somewhere like Webflow or Shopify, these website builders will have built-in tools you can use to generate your sitemap.
Generating your XML sitemap with Yoast
- Install and activate the Yoast SEO plugin.
- Go to Yoast SEO > General > Features.
- Toggle the XML sitemaps feature to ‘On’.
- Click the question mark next to XML sitemaps. Then click “See the XML sitemap” to view it.
How to submit an XML Sitemap
To get all the good benefits of your XML sitemap, you’ll need to submit it to search engines. This is how you get your pages crawled and indexed sooner.
Submitting your XML sitemap to Google or Bing is pretty easy.
Google Search Console
- Sign in to Google Search Console.
- Select your website property.
- Go to Sitemaps in the menu on the left.
- Enter the URL of your sitemap in the ‘Add a new sitemap’ section. Eg: https://example.com/sitemap.xml
- Submit the sitemap.
- Double-check it’s been crawled. (It might be rejected if it has an error.)
Bing Webmaster Tools
- Sign in to Bing Webmaster Tools.
- Select your site from the dashboard.
- Navigate to ‘Sitemaps’ in the ‘Configure My Site’ menu.
- Add the URL of your sitemap and submit it. Eg: https://example.com/sitemap.xml
Common problems with XML Sitemaps 🤯
Sitemap errors ❌
Duplicate content 👯♀️
Your XML sitemap shouldn’t have duplicate URLs.
If you have this problem, you should first check whether you have listed the same page twice.
If you have two pages that are largely the same, make sure you only include the page you’d like to be indexed and set up a canonical tag.
Make sure each URL in your sitemap is unique.
Invalid URLs
Make sure all your URLs are formatted properly, including all elements of a URL from the ‘https’ to the final ‘/page/’:
https:www.example.com.au/page/
Incorrect file format
The sitemap needs to be an XML file that’s formatted with the correct XML sitemap protocol.
Sitemap too large
Your sitemap shouldn’t exceed 50MB or 50,000 URLs.
If you have to, split your sitemap into multiple sitemaps and then create a sitemap index file.
HTTP Errors
URLs with 4xx or 5xx HTTP status codes should be corrected. 4xx codes, like 404 or 403s, generally mean the page can’t be accessed — make sure the page exists and has public permission to view.
5xx codes are to do with the server. Make sure your website is healthy and up to date and check your server logs for issues.
Non-canonical URLs
Include only canonical (preferred) URLs to prevent duplicate content issues.
Outdated URLs
Keep your sitemap up to date so old, dated content is removed.
Wrong use of priority and changefreq
Misusing <priority>and <changefreq> tags can mislead search engines on how important your pages are or how frequently they’re updated.
This can mean your pages aren’t indexed correctly and your crawl budget is wasted on the wrong pages.
Don't forget to maintain your sitemap file & keep it up to date
What is a Robots.txt File? 🤖
Robots.txt is a text file that lives on your website. It gives instructions to search engine crawlers about which pages they should and shouldn’t access.
It lives in the root directory of your site. For example: example.com/robots.txt).
The file uses specific commands, like “User-agent”, to specify the crawler and “Disallow” to block access to specific pages or directories.
This protects specific pages or information, directs the crawler to the most important pages, and makes sure unimportant content is ignored.
What are the components of Robots.txt? 🧱
User-agent
User-agent specifies which search engine’s crawlers should follow the rules you’re providing.
Each search engine has a unique user-agent name. For example: Google’s is Googlebot.
If you use ‘*’, then the rules apply to any search engine crawler.
Example text: User-agent: *
Disallow
This tells the crawler not to access the page or directory. It will prevent the search engine from indexing the page.
Example text: Disallow: /admin/
Allow
Now, say you used ‘Disallow’ to say a search engine can’t crawl a directory on your website.
But then, there is a page inside that directory you still want it to crawl. You would then use ‘Allow’ to specify that it should index that page.
Example text: Allow: /specific-page/
Sitemap
This is where you provide the XML sitemap to help the search engine navigate your website easily.
Example text: Sitemap: https://www.example.com/sitemap.xml
How to create a Robots.txt File 🤓
Creating your website’s Robots.txt file is pretty straightforward.
- Open a Text Editor: Use any text editor like your computer’s built-in notepad.
- Specify User-Agents: Indicate which web crawlers the rules apply to. Use * for all crawlers.
- Add Disallow and Allow instructions: Specify which pages on your website crawlers can and cannot access.
- Specify sitemap location: Include the URL of your XML sitemap so the search engines can find it.
- Save the file: Save the file as robots.txt.
- Upload to your root directory: Place the robots.txt file in the root directory of your website. For example: https://www.example.com/robots.txt
What Robots.txt is used for?
Robots.txt directs search engine crawlers around your website.
Here are some specific ways this comes in handy:
Blocking specific folders or files ⛔️
You can make sure the crawler doesn’t look at unnecessary pages.
This is using ‘Disallow:’.
🚩 It’s worth noting: you can no-index your sensitive or confidential information, but putting these links in your Robots.txt file can point *bad* bots in the right direction to hack your website or access your information.
Preventing duplicate content 👯♀️
Sometimes, other pages on your website can accidentally create duplicate content issues.
Some websites will use their Robots.txt file to ‘Disallow’ these pages. We’d recommend using a canonical tag for your preferred page instead.
Specifying crawl delay 🛑
You can manage how frequently search engine crawlers crawl your website and pages.
This is a good idea for websites with a lot of traffic. Delaying the number of times the crawlers can make these requests on your website might decrease the pressure on your server.
This is also a good idea if you want to make the most of the crawl budget
we talked about at the beginning of this blog. Enabling crawl delay ensures the crawlers use their budget effectively across your website’s pages.
You would use:
User-agent: *
Crawl-delay: 10
How to test & validate Robots.txt 🧪
There is of course, the manual testing of your Robots.txt file. This is where you go to your robots.txt URL yourself, ensure it’s formatted correctly, and double-check all of its links.
But, like just about any aspect of SEO, there’s a tool for it, too.
Google’s Robots.txt Tester
Head over to Google Search Console, sign in, and select your website.
- Find the Robots.txt Tester under ‘Legacy tools and reports’.
- Paste your robots.txt content into the editor.
- You can then test specific URLs. You’ll be able to see if they’re allowed or blocked with your specified rules.
- Fix any errors based on your testing.
Common errors and how to fix them 🪛
A few errors can come up with Robots.txt. Here’s what they are and how to fix them.
Syntax errors
Spelling or formatting mistakes that can make your instructions invalid. Double-check your Robots.txt for any errors.
Blocking important pages
Make sure you’re not accidentally disallowing pages that you actually do want crawled and indexed.
Double-check that every ‘Disallow’ is correct.
Allowing sensitive content
Similarly, make sure you don’t accidentally ‘Allow’ content you don’t want indexed.
Double-check that every ‘Allow’ is correct.
Nonexistent paths
Specifying instructions for pages that don’t exist.
This might be from a typo in your URLs or from pages that are listed in your Robots.txt but have since been removed from the website.
Mixed directories
Make sure your rules make sense. You can’t ‘Allow’ a page and ‘Disallow’ the same page — that’s just confusing.
Incorrect sitemap URL
Make sure your XML sitemap link is written correctly and is able to be accessed.
How else can you help search engines crawl your website? 🕸
Remember at the start of this article when we talked about crawl budgets? Your Robots.txt and XML sitemap are basically exercises in helping Google spend less time and money crawling your website.
By spending less time, it indexes the right pages on your website sooner.
It’s important to note that Robots.txt and XML sitemaps aren’t the only way to help Google, or the other search engines out.
Internal linking 🔗
Using internal linking, where you link from one page on your website to another, can help guide search engines around your website.
Internal links will help the bot or crawler understand the context of the page it’s visiting, the relationship between the pages on your website, and the hierarchy and importance of the pages. The importance of the page is actually measured by PageRank , which is Google’s algorithm for determining where your pages show up in the SERP.
By linking between pages in a useful and intuitive way, you can make sure the search engine crawls the most important pages on your website, making the most of that crawl budget.