SEO8 min read8 December 2024By ToolFocus

Robots.txt: The Complete Guide for SEO and Web Developers

A complete guide to robots.txt — how it works, how to write it correctly, common mistakes that tank rankings, and how to use it strategically for SEO.

Estimated reading time: 8 minutes

The robots.txt file is one of the most misunderstood files in web development. It sits at the root of your domain, often barely 10 lines long, but it tells search engine crawlers how to navigate your site. Used correctly, it helps search engines focus their crawl budget on your most important pages. Misused, it can accidentally block your entire site from Google. This guide covers everything you need to know.

> Create yours now: Use our free [Robots.txt Generator](/tools/robots-txt-generator) to create a correctly formatted robots.txt file for your site in seconds — no technical knowledge required.

What is Robots.txt?

Robots.txt is a plain text file placed at the root of a website (for example, https://example.com/robots.txt) that follows the Robots Exclusion Protocol. It contains instructions for web crawlers — primarily search engine bots — about which pages or sections they are allowed to crawl and index.

When a well-behaved crawler visits a website, it first checks robots.txt before crawling any other page. If the file instructs the crawler to avoid certain pages or directories, a compliant crawler will honour those instructions.

Important: Robots.txt is a courtesy protocol, not a security mechanism. Malicious scrapers and bots will ignore it entirely. Never use robots.txt to protect sensitive information — use server-side authentication and access controls for that.

Basic Robots.txt Syntax

A robots.txt file consists of one or more records, each beginning with a User-agent directive and followed by Allow and/or Disallow directives.

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/

Sitemap: https://example.com/sitemap.xml

User-agent: Specifies which crawler the following rules apply to. The asterisk (*) means all crawlers. You can target specific crawlers by name (Googlebot, Bingbot, GPTBot).

Disallow: Specifies paths the crawler should not access. An empty Disallow value means "allow everything."

Allow: Explicitly allows a path, even within a disallowed directory.

Sitemap: Points crawlers to your XML sitemap. Always include this.

Path Matching in Robots.txt

Rules apply to URL paths, not full URLs. Some matching behaviours:

Disallow: /private/ blocks /private/, /private/page1, /private/data/file.html
Disallow: / blocks the entire site (be very careful!)
Disallow: /*.pdf$ blocks all PDF files (Googlebot supports wildcards)
Disallow: /search?* blocks all search result pages

What to Block with Robots.txt

Typically block:

Admin interfaces (/admin/, /wp-admin/, /dashboard/)
Search result pages (/search?q=) — search-on-search creates low-quality duplicate content
Cart and checkout pages (/cart/, /checkout/)
User account areas (/account/, /my-profile/)

Do NOT block:

CSS and JavaScript files — Googlebot needs these to render your pages fully. If Googlebot cannot access CSS/JS, it cannot accurately assess your pages.
Any page you want indexed — this sounds obvious, but blocking a directory in robots.txt and then wondering why those pages are not indexed is a surprisingly common mistake.

The Crawl Budget Concept

Every website has a "crawl budget" — the number of pages Google will crawl in a given period. For small sites, crawl budget is rarely an issue. For large sites (millions of pages), managing crawl budget matters.

Blocking low-value pages (parameter-based duplicates, faceted navigation combinations, utility pages) with robots.txt can help Google spend more crawl budget on your high-value pages. This is especially relevant for large e-commerce and news sites.

Robots.txt vs. Noindex — Critical Difference

Robots.txt Disallow	Noindex Meta Tag
Controls	Crawling (fetching the page)	Indexing (appearing in search results)
Googlebot accesses page?	No	Yes
Page can appear in search?	Possibly (if linked)	No
Use when	You don't want page fetched at all	You want page fetched but not shown in results

Key mistake to avoid: If you Disallow a page in robots.txt, Google cannot crawl it to see the noindex meta tag either. If you want a page excluded from search results, use the noindex meta tag — and leave the page accessible for crawling.

A Production-Ready Example

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /search?*
Allow: /wp-admin/admin-ajax.php

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Note the GPTBot block — this AI training crawler can be blocked by many site owners who prefer their content not be used for AI training without consent.

Common Mistakes

Disallow: / in production: Blocks your entire site from all crawlers. This is the number one robots.txt disaster — a staging site's robots.txt accidentally copied to production.

Blocking CSS and JS: Prevents proper page rendering by Googlebot.

Thinking robots.txt provides security: Anyone can view your robots.txt and learn exactly which paths you are trying to hide.

Not including a Sitemap reference: Always include the Sitemap directive pointing to your XML sitemap.

Conflicting Allow/Disallow rules: When rules conflict, most crawlers use the most specific rule. Test your rules using the robots.txt Tester in Google Search Console.

Frequently Asked Questions

Q: Where should I put my robots.txt file?

At the root domain: https://yourdomain.com/robots.txt. It must be at this exact location — subfolders like /public/robots.txt will not be found. Create yours with our [Robots.txt Generator](/tools/robots-txt-generator).

Q: Does robots.txt affect Google rankings?

Not directly. However, it controls which pages Google can crawl. Wasting crawl budget on low-value pages can indirectly reduce how often Google crawls your important pages, which can slow down the indexing of new content.

Q: Will blocking a page in robots.txt remove it from Google search results?

Not necessarily. If the blocked page has inbound links, Google may still list the URL in search results with "No information available for this page." Use a noindex meta tag (on a crawlable page) to properly exclude a page from search results.

Q: How do I test if my robots.txt is working correctly?

Use the robots.txt Tester in Google Search Console, or navigate to yoursite.com/robots.txt to view the file directly. Enter specific URLs to check whether they are allowed or blocked by your current rules.

Conclusion

Robots.txt is simple but consequential. A few lines can direct search engine crawlers efficiently, preserve crawl budget, and prevent low-quality pages from cluttering your index. [Use ToolHub's Robots.txt Generator](/tools/robots-txt-generator) to create a properly formatted file for your site, and always test it in Google Search Console before deploying.