Skip to main content
UtilityStack

Robots.txt Generator — control crawlers, AI bots, indexing

Pick a preset (allow all, block AI bots, standard) or build per-bot rules manually. The robots.txt content is generated live, ready to paste at your domain root.

Preset:
  • Disallow paths
    • /admin/
    • /?
    Allow paths

    robots.txt output
    User-agent: *
    Disallow: /admin/
    Disallow: /?
    
    Sitemap: https://example.com/sitemap.xml
    

    What is robots.txt?

    robots.txt is a plain-text file at the root of your domain (https://example.com/robots.txt) that asks crawlers what they may and may not fetch. Honest crawlers — Googlebot, Bingbot, well-behaved researchers — read it before indexing. The Allow/Disallow rules are advisory: they don't prevent malicious bots, but they do prevent legitimate ones from indexing the wrong things.

    Since 2023, robots.txt has become the standard mechanism for opting out of AI training crawlers (GPTBot, ClaudeBot, CCBot, PerplexityBot…). Each company publishes the user-agent string they crawl with; you simply add a User-agent + Disallow: / pair per company. This generator includes presets for the major AI bots and the standard SEO defaults.

    How to use this tool

    1. Pick a preset that matches your goal: allow everyone, block everyone, block only AI bots, or the standard pattern that hides /admin and query-string URLs.
    2. Refine the rules: add or remove user-agents, list paths to allow or disallow, paste your sitemap URL. Add a crawl-delay if a particular bot is hammering your server.
    3. Copy the generated robots.txt and place it at the root of your domain so it's accessible at https://yourdomain.com/robots.txt. Test with Google Search Console's robots.txt Tester to confirm the rules behave as expected.

    Frequently asked questions

    Does robots.txt block scraping?

    No. It's a polite request that well-behaved crawlers follow. Malicious or aggressive scrapers ignore it entirely. To enforce blocking, you need rate-limiting, IP firewalls, or authentication — robots.txt is for cooperation, not security.

    Will blocking GPTBot affect my Google ranking?

    No. GPTBot is OpenAI's training crawler; Google ranking uses Googlebot, which is a different user-agent. You can block AI training crawlers without affecting search visibility — just keep Googlebot allowed.

    Where do I put the file?

    At the root of your domain, served as plain text at https://example.com/robots.txt. It applies to that specific subdomain — robots.txt at the apex doesn't cover blog.example.com, which needs its own file.

    Can I use wildcards?

    Most major crawlers (Google, Bing) support * to match any sequence of characters and $ to anchor to end. So Disallow: /*.pdf$ blocks PDF files, Disallow: /admin/* blocks anything inside /admin/. Yandex and some smaller crawlers have more limited wildcard support.

    What's Crawl-delay for?

    It tells the bot to wait N seconds between requests. Useful when a single crawler is overwhelming a small server. Google ignores Crawl-delay (use Search Console's crawl rate setting instead); Bing and Yandex do honor it.

    Common use cases

    Where a thoughtful robots.txt prevents real problems.

    Hide internal areas

    Disallow /admin/, /staging/, /preview/, and search-result URLs (/search?q=). These pages should not appear in Google; if they do, that's wasted crawl budget plus possible duplicate-content issues.

    Opt out of AI training

    If you don't want your content used to train LLMs, the easiest mitigation is the AI-bot block preset (GPTBot, ClaudeBot, CCBot, PerplexityBot). Honest companies respect it; you also have legal grounds if a respect-the-flag-then-don't case ever reaches court.

    Reference your sitemap

    Adding a Sitemap: URL in robots.txt is the lowest-friction way to point all crawlers at your sitemap. Search Console picks up the sitemap reference automatically — no manual submission needed.

    Throttle aggressive bots

    If a particular Russia-based bot is generating 80% of your traffic and giving zero referral value, add a Crawl-delay rule (or block it outright with Disallow: /). Server CPU is finite; not all crawl is worth it.

    Tips and shortcuts

    Habits that keep robots.txt useful and safe.

    Always allow CSS and JS

    Googlebot needs to fetch CSS and JavaScript to render your pages. Blocking them via Disallow: /static/ or similar can break ranking. Verify using the URL Inspection tool in Search Console before disallowing static asset directories.

    Don't use robots.txt for hiding secrets

    Adding a path to Disallow makes it discoverable — anyone can read your robots.txt and see exactly what you're trying to hide. Sensitive URLs need authentication, not robots.txt entries.

    Test before deploying

    Use Google Search Console's robots.txt Tester to paste your draft and check that specific URLs would be allowed or disallowed as you expect. A typo can de-index half your site overnight.

    Pair with noindex for soft-deletion

    If you want a page out of the index but still crawlable (so Google sees the noindex), use a meta robots noindex tag, not Disallow. Disallowing prevents the crawler from seeing the noindex and the URL can stay indexed forever.

    Outils similaires