Allowing or Denying AI Bot Browsing of your Website

Website Tools & SEO | Updated June 2026

AI crawlers from ChatGPT, Claude, Perplexity, Google's Gemini training pipeline, and a growing list of others now make up a measurable share of bot traffic across the web. This guide walks through the two practical ways to control whether they read your site on Ultra Web Hosting: a robots.txt file for cooperative bots, and .htaccess rules for hard enforcement when robots.txt is not enough.

When That Is Not Enough

htaccess for Hard Enforcement

robots.txt is a request, not a wall. If you see repeated hits from a user-agent that claims to honor robots.txt but does not, or from a scraper that openly ignores it, an .htaccess rule returns 403 Forbidden at the server before any page renders. Section 06 has the exact directive.

  • Returns 403 to the listed user-agents, bypassing the politeness contract entirely
  • Adds zero overhead on legitimate visitors (the match runs in microseconds)
  • Easy to extend as new bots appear

01. What AI Bots Are and Why They Visit Your Site

AI bots are crawlers operated by companies building large language models or AI-powered search products. They fetch the text on your pages either to (a) train the next generation of a model or (b) answer a user's real-time question by reading your site in the moment. They look like ordinary HTTP clients to your server. The thing distinguishing them is the User-Agent header they send.

The major operators and their declared user-agents as of mid-2026:

  • OpenAI - GPTBot (training), ChatGPT-User and OAI-SearchBot (live answers)
  • Anthropic - ClaudeBot, anthropic-ai, Claude-Web
  • Google - Google-Extended (controls AI training use of Googlebot's crawl)
  • Perplexity - PerplexityBot, Perplexity-User
  • ByteDance / TikTok - Bytespider
  • Meta - FacebookBot, Meta-ExternalAgent
  • Apple - Applebot-Extended
  • Amazon - Amazonbot
  • Common Crawl - CCBot (a public dataset that most AI training pulls from)
  • Cohere - cohere-ai
  • Diffbot - Diffbot (structured-data extraction)
Tip

Google-Extended is the odd one out: it controls whether Google's existing Googlebot crawl can be used for training Gemini, separate from search indexing. Blocking Google-Extended does not hide you from Google Search. Blocking Googlebot does.

02. robots.txt vs htaccess: When to Use Which

There are two layers to bot control, and most sites should use both.

Add This If Needed

htaccess

An Apache config file that lets you return 403 Forbidden based on the User-Agent header. The block happens at the server level before WordPress or any other application runs. This is hard enforcement.

  • Cannot be ignored - the 403 is sent regardless of bot intent
  • Slight risk - a malformed rule can block legitimate visitors
  • Less universal - blocks only by user-agent string, not by behavior
  • Best for: scrapers, persistent ignorers, custom enforcement

The typical sequence: start with robots.txt for the politeness path, then add .htaccess only for the bots you observe ignoring robots.txt in your logs.

03. Creating Your robots.txt File

  1. Log in to cPanel for the domain you want to update. From the dashboard, open File Manager.
  2. Navigate to public_html (or whichever directory is the document root for the domain, addon domains use a subfolder).
  3. Check for an existing robots.txt. If it is already there, click it and choose Edit. If not, click + File at the top, name the new file robots.txt (lowercase, no extension other than .txt), and open it for editing.
  4. Add your rules from Section 04 or Section 05. Each rule is a User-agent line followed by one or more Allow or Disallow lines, with a blank line between groups.
  5. Save and close. The file is live immediately.
  6. Verify by visiting https://yourdomain.com/robots.txt in your browser. You should see the file's contents.
Important

The file must be named exactly robots.txt, all lowercase, in the document root of the domain you want to control. Bots only check the root path. A robots.txt inside /wp-content/ or any subfolder is ignored.

04. robots.txt Rules to Block Common AI Bots

The block-list approach. Search engines (Googlebot, Bingbot, etc.) still crawl normally. Only the AI agents listed below are asked to stay out.

# Block AI crawlers, allow everything else
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

# Allow all other bots
User-agent: *
Allow: /

Save that as your robots.txt. The final User-agent: * block makes the intent explicit for search engines and other crawlers.

05. robots.txt Rules to Allow Only Specific Bots

The opposite approach: deny everything, then carve out exceptions for the bots you actually want. This is stricter and lower-maintenance (no need to add each new AI crawler to a block list), but you have to remember every legitimate bot you want to allow.

# Allow only major search engines, block everything else
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

User-agent: Slurp
Allow: /

User-agent: Baiduspider
Allow: /

User-agent: YandexBot
Allow: /

User-agent: *
Disallow: /
Tip

The order of blocks matters less than the specificity. Bots match the most specific User-agent block that applies to them. Googlebot will follow the User-agent: Googlebot block and ignore User-agent: *.

06. Blocking AI Bots with htaccess (Hard Block)

If a bot is hitting your site despite robots.txt, an .htaccess rule returns 403 Forbidden at the Apache level. Add this to your .htaccess file in public_html (create one if it does not exist):

<IfModule mod_rewrite.c>
RewriteEngine On

# Block AI training and retrieval bots
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ChatGPT-User|OAI-SearchBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ClaudeBot|anthropic-ai|Claude-Web) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (PerplexityBot|Perplexity-User) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Bytespider|CCBot|cohere-ai|Diffbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FacebookBot|Meta-ExternalAgent) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Applebot-Extended|Google-Extended|Amazonbot) [NC]
RewriteRule .* - [F,L]
</IfModule>

What each piece does:

  • RewriteCond %{HTTP_USER_AGENT} - checks the requesting client's User-Agent header against the pattern in parentheses.
  • [NC] - case-insensitive match.
  • [OR] - combines this condition with the next one as a logical OR. The final condition omits OR so the chain terminates.
  • RewriteRule .* - [F,L] - if any condition matched, return 403 Forbidden ([F]) and stop processing ([L]).
Test Carefully

A bad .htaccess rule can produce a 500 Internal Server Error and take your site offline. After saving changes, immediately load your homepage in a private browser window. If you get a 500, edit the file and revert. See our Complete Guide to htaccess for the full syntax reference.

07. Allowing Only Search Engines with htaccess

A stricter pattern: block any user-agent that looks like a bot but is not on your allow-list. This catches future AI bots automatically because they generally identify themselves with "bot" or "crawler" in the user-agent string.

<IfModule mod_rewrite.c>
RewriteEngine On

# Let legitimate search engines through
RewriteCond %{HTTP_USER_AGENT} (Googlebot|Bingbot|DuckDuckBot|Slurp|Baiduspider|YandexBot|AppleBot) [NC]
RewriteRule .* - [L]

# Block anything else that claims to be a bot or crawler
RewriteCond %{HTTP_USER_AGENT} (bot|crawler|spider|scrape|fetch) [NC]
RewriteRule .* - [F,L]
</IfModule>

The first rule short-circuits (with [L]) for trusted search engines, so the block rule that follows only applies to bots that did not match. Real browser visitors are not affected because they do not have "bot" or "crawler" in their user-agent.

Note

This pattern is more aggressive. It will also block tools like uptime monitors, RSS readers, link checkers, and any legitimate service whose user-agent includes "bot" or "crawler". If you use those services, add their user-agents to the allow-list in the first rule.

08. Verifying Your Rules Are Working

After deploying either approach, test from the command line by sending requests with the offending user-agents. From any machine with curl installed:

# Should return 403 Forbidden after .htaccess block
curl -A "GPTBot" -I https://yourdomain.com/

# Should return 200 OK (Googlebot is allowed)
curl -A "Googlebot" -I https://yourdomain.com/

# Should return 200 OK (regular browser)
curl -I https://yourdomain.com/

The first line should return HTTP/2 403 or HTTP/1.1 403 Forbidden. The other two should return 200 OK.

For robots.txt verification, simply visit https://yourdomain.com/robots.txt in a browser. The file should display as plain text with your rules. Then check your access logs in cPanel (Metrics > Raw Access) over the next several days to confirm the bot user-agents are absent or returning 403.

Tip

Google publishes a robots.txt tester inside Search Console: Settings > robots.txt > Open report. It shows you exactly how Googlebot reads your file. There is not an equivalent tool for AI bots, so curl is the most reliable check.

09. Should You Block AI Bots At All?

There is a legitimate trade-off here. Reasons to allow AI bots:

  • Visibility in AI answers. When someone asks Perplexity or ChatGPT about your product or topic, your content can appear in the answer. Blocking the retrieval bots removes you from that surface.
  • Brand authority. Being cited as a source in AI responses builds the same kind of reputation that being cited in Wikipedia or major publications does.
  • Future search. Google's AI Overviews and similar features pull from indexed content. Blocking too aggressively can hurt your visibility in search results, not just in AI summaries.

Reasons to block them:

  • You do not want your content reused for training. AI models trained on your work may produce derivative output that competes with you, without attribution or compensation.
  • Bandwidth and server load. Aggressive crawlers can eat a meaningful share of your hosting resources on busy sites. See Resource Limit Reached errors for what happens at the limit.
  • You have a content moat. Paid newsletters, premium guides, or proprietary research are typically blocked to keep value behind your paywall.

A common middle ground is to allow live-retrieval bots (so you appear in real-time AI answers) but block training bots (so your content is not memorized into a model). With that goal, block: GPTBot, Google-Extended, ClaudeBot, anthropic-ai, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, Amazonbot, cohere-ai, and Diffbot. Allow: ChatGPT-User, OAI-SearchBot, Claude-Web, Perplexity-User.

Going Further

If you are on a busy site and getting hammered by scrapers ignoring both robots.txt and .htaccess, Cloudflare's free plan includes automated bot management at the edge, which is more sophisticated than user-agent matching. See our Cloudflare Setup Guide for how to put your site behind it.

Need Help Configuring This?

If you would like a hand customizing the rules for your specific site or want us to deploy them for you, open a support ticket. Our team handles robots.txt and .htaccess setups on shared, VPS, and dedicated plans.

Open a Support Ticket

Quick Recap: AI Bot Control in Five Steps

If you only do five things from this guide, do these:

  1. Decide your goal - block training crawlers only, block all AI bots, or block everything except major search engines.
  2. Start with robots.txt - drop the block list from Section 04 into public_html/robots.txt. Covers cooperative bots with zero risk.
  3. Add htaccess only if needed - if logs show bots ignoring robots.txt, deploy the rule from Section 06. Test the homepage in a private window immediately after saving.
  4. Verify with curl - send a request with each blocked user-agent and confirm you get 403. Real browsers should still get 200 OK.
  5. Re-check quarterly - new AI bots appear every few months. Audit your logs for unfamiliar user-agents and extend the rules as needed.

Last updated June 2026 · Browse all Website Tools & SEO articles

  • 1 Users Found This Useful

Was this answer helpful?

Related Articles

How to Submit Your Site to Search Engines

Website Tools & SEO | Updated March 2026 You don't need to manually submit your site to...

Creating and Submitting a Google Sitemap

Website Tools & SEO | Updated 2026 A sitemap is an XML file that lists all the pages on...

Where can I find an HTML editor?

Website Tools & SEO | Updated 2026 If you want to build or edit web pages without...

Change your Websites Favorite Icon

Website Tools & SEO | Updated 2026 A favicon is the small icon that appears in browser...

How can I make search engine friendly urls with my app?

Website Tools & SEO | Updated 2026 SEO-friendly URLs (also called "clean URLs" or "pretty...



Save 30% on web hosting - Use coupon code Hosting30