Question 1

What is a robots.txt file?

Accepted Answer

robots.txt is a plain-text file placed in the root directory of a website (e.g., https://example.com/robots.txt) that communicates crawling instructions to web robots — primarily search engine bots. It follows the Robots Exclusion Standard (REP), a de-facto protocol established in 1994. When a compliant crawler visits a site for the first time, it fetches robots.txt before crawling any other page. The file tells the crawler which URLs it is allowed or not allowed to request. Note that robots.txt is a suggestion, not a firewall — malicious or non-compliant bots may ignore it entirely.

Question 2

Does robots.txt prevent pages from being indexed by Google?

Accepted Answer

No — this is one of the most common misconceptions. A Disallow directive prevents Googlebot from crawling a URL, but it does not prevent it from being indexed. If other pages link to a disallowed URL, Google can still discover, index, and rank it through those links alone, even without ever visiting the page. To prevent indexing, you need a noindex meta tag or X-Robots-Tag HTTP header on the page itself. Disallowing crawling of a page you want to noindex is actually counterproductive: Googlebot cannot read the noindex tag if it cannot crawl the page.

Question 3

What does 'User-agent: *' mean?

Accepted Answer

The asterisk (*) is a wildcard that matches all web crawlers. A User-agent: * block sets default rules that apply to any bot not covered by a more specific User-agent block. When multiple blocks exist, a crawler applies the most specific one that matches its user-agent string; the * block is a fallback. You can combine a specific block for, say, Googlebot with a catch-all * block that sets different rules for all other crawlers.

Question 4

What is the difference between Allow and Disallow?

Accepted Answer

Disallow: /path/ tells the crawler not to fetch that path or any URL that begins with it. Allow: /path/ explicitly permits crawling of a path, and is useful to carve out exceptions within a broader Disallow rule. For example, Disallow: /private/ with Allow: /private/public-page.html allows crawling of that specific page while blocking everything else under /private/. When both apply to the same URL, the longer (more specific) matching rule wins.

Question 5

Should I add a Sitemap directive to robots.txt?

Accepted Answer

Yes — it is best practice. The Sitemap: directive gives search engines a direct pointer to your XML sitemap file, making it easier for them to discover all your pages without relying solely on link crawling. You can include multiple Sitemap: lines, one for each sitemap (e.g., separate sitemaps for pages, images, and news). Google also accepts sitemap submissions directly in Google Search Console, but including it in robots.txt ensures any crawler, not just Google, can find it.

Question 6

What is Crawl-delay and does Google respect it?

Accepted Answer

Crawl-delay is a directive that asks a crawler to wait a specified number of seconds between consecutive requests to your server. It is useful for telling aggressive crawlers to slow down and reduce server load. However, Google does not support Crawl-delay in robots.txt — to control Googlebot's crawl rate you must use Google Search Console's crawl rate settings instead. Some other crawlers (Bingbot, Yandex, and others) do honour it.

Question 7

How do I block AI training bots like GPTBot or CCBot?

Accepted Answer

Add a specific User-agent block for each AI crawler with Disallow: /. Known AI crawlers include: GPTBot (OpenAI), CCBot (Common Crawl, used by many AI training datasets), anthropic-ai (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google's AI training bot, separate from Googlebot), and FacebookBot. Note that blocking these bots only prevents future crawling — it does not remove your content from any existing training dataset, and non-compliant bots will ignore it.

Question 8

Can robots.txt files hurt my SEO?

Accepted Answer

Yes, misconfigured robots.txt files are a surprisingly common cause of SEO problems. The most frequent mistakes are: (1) accidentally disallowing crawling of CSS, JavaScript, or image files that Google needs to render pages correctly; (2) blocking your entire site with Disallow: / during development and forgetting to update it for production; (3) disallowing pages you actually want indexed; (4) using robots.txt to try to prevent indexing (which does not work — use noindex instead). Always test your file with Google Search Console's robots.txt Tester after making changes.

Robots.txt Generator,
build your crawl rules visually.

Global settings

User-agent blocks (1)

What robots.txt does, how crawlers read it, and how to use it correctly.

How crawlers process robots.txt

The robots.txt format

Rule matching and precedence

Critical misconception: robots.txt ≠ noindex

What to disallow and what to leave open

Blocking AI training crawlers

Sitemap declaration

Testing your robots.txt