AI crawlers from ChatGPT, Claude, Perplexity, Google's Gemini training pipeline, and a growing list of others now make up a measurable share of bot traffic across the web. This guide walks through the two practical ways to control whether they read your site on Ultra Web Hosting: a robots.txt file for cooperative bots, and .htaccess rules for hard enforcement when robots.txt is not enough.
- What AI Bots Are and Why They Visit Your Site
- robots.txt vs htaccess: When to Use Which
- Creating Your robots.txt File
- robots.txt Rules to Block Common AI Bots
- robots.txt Rules to Allow Only Specific Bots
- Blocking AI Bots with htaccess (Hard Block)
- Allowing Only Search Engines with htaccess
- Verifying Your Rules Are Working
- Should You Block AI Bots At All?
Drop-in robots.txt for AI Bots
Most AI bot operators publish their user-agent string and honor robots.txt directives. A single text file in your web root covers the major AI training crawlers in seconds, with no server changes required. Section 04 has a copy-paste block list you can use as-is.
- Works for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, CCBot, and more
- No restart, no downtime, no risk of breaking your site
- Takes effect on the bot's next crawl cycle (typically within 24 hours)
htaccess for Hard Enforcement
robots.txt is a request, not a wall. If you see repeated hits from a user-agent that claims to honor robots.txt but does not, or from a scraper that openly ignores it, an .htaccess rule returns 403 Forbidden at the server before any page renders. Section 06 has the exact directive.
- Returns 403 to the listed user-agents, bypassing the politeness contract entirely
- Adds zero overhead on legitimate visitors (the match runs in microseconds)
- Easy to extend as new bots appear
01. What AI Bots Are and Why They Visit Your Site
AI bots are crawlers operated by companies building large language models or AI-powered search products. They fetch the text on your pages either to (a) train the next generation of a model or (b) answer a user's real-time question by reading your site in the moment. They look like ordinary HTTP clients to your server. The thing distinguishing them is the User-Agent header they send.
The major operators and their declared user-agents as of mid-2026:
- OpenAI - GPTBot (training), ChatGPT-User and OAI-SearchBot (live answers)
- Anthropic - ClaudeBot, anthropic-ai, Claude-Web
- Google - Google-Extended (controls AI training use of Googlebot's crawl)
- Perplexity - PerplexityBot, Perplexity-User
- ByteDance / TikTok - Bytespider
- Meta - FacebookBot, Meta-ExternalAgent
- Apple - Applebot-Extended
- Amazon - Amazonbot
- Common Crawl - CCBot (a public dataset that most AI training pulls from)
- Cohere - cohere-ai
- Diffbot - Diffbot (structured-data extraction)
Google-Extended is the odd one out: it controls whether Google's existing Googlebot crawl can be used for training Gemini, separate from search indexing. Blocking Google-Extended does not hide you from Google Search. Blocking Googlebot does.
02. robots.txt vs htaccess: When to Use Which
There are two layers to bot control, and most sites should use both.
robots.txt
A text file at the root of your domain (yourdomain.com/robots.txt) that lists user-agents and the paths they may or may not crawl. It is a convention, not a security measure. Reputable bot operators check it before crawling and respect the directives.
- Easy to maintain - it is a plain text file
- No server config risk - a typo cannot break your site
- Universal - every documented AI bot supports it
- But: bots that ignore it (or spoof a different user-agent) get through
htaccess
An Apache config file that lets you return 403 Forbidden based on the User-Agent header. The block happens at the server level before WordPress or any other application runs. This is hard enforcement.
- Cannot be ignored - the 403 is sent regardless of bot intent
- Slight risk - a malformed rule can block legitimate visitors
- Less universal - blocks only by user-agent string, not by behavior
- Best for: scrapers, persistent ignorers, custom enforcement
The typical sequence: start with robots.txt for the politeness path, then add .htaccess only for the bots you observe ignoring robots.txt in your logs.
03. Creating Your robots.txt File
- Log in to cPanel for the domain you want to update. From the dashboard, open File Manager.
- Navigate to public_html (or whichever directory is the document root for the domain, addon domains use a subfolder).
- Check for an existing robots.txt. If it is already there, click it and choose Edit. If not, click + File at the top, name the new file
robots.txt(lowercase, no extension other than .txt), and open it for editing. - Add your rules from Section 04 or Section 05. Each rule is a User-agent line followed by one or more Allow or Disallow lines, with a blank line between groups.
- Save and close. The file is live immediately.
- Verify by visiting
https://yourdomain.com/robots.txtin your browser. You should see the file's contents.
The file must be named exactly robots.txt, all lowercase, in the document root of the domain you want to control. Bots only check the root path. A robots.txt inside /wp-content/ or any subfolder is ignored.
04. robots.txt Rules to Block Common AI Bots
The block-list approach. Search engines (Googlebot, Bingbot, etc.) still crawl normally. Only the AI agents listed below are asked to stay out.
# Block AI crawlers, allow everything else
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
# Allow all other bots
User-agent: *
Allow: /
Save that as your robots.txt. The final User-agent: * block makes the intent explicit for search engines and other crawlers.
05. robots.txt Rules to Allow Only Specific Bots
The opposite approach: deny everything, then carve out exceptions for the bots you actually want. This is stricter and lower-maintenance (no need to add each new AI crawler to a block list), but you have to remember every legitimate bot you want to allow.
# Allow only major search engines, block everything else
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: DuckDuckBot
Allow: /
User-agent: Slurp
Allow: /
User-agent: Baiduspider
Allow: /
User-agent: YandexBot
Allow: /
User-agent: *
Disallow: /
The order of blocks matters less than the specificity. Bots match the most specific User-agent block that applies to them. Googlebot will follow the User-agent: Googlebot block and ignore User-agent: *.
06. Blocking AI Bots with htaccess (Hard Block)
If a bot is hitting your site despite robots.txt, an .htaccess rule returns 403 Forbidden at the Apache level. Add this to your .htaccess file in public_html (create one if it does not exist):
<IfModule mod_rewrite.c>
RewriteEngine On
# Block AI training and retrieval bots
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ChatGPT-User|OAI-SearchBot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (ClaudeBot|anthropic-ai|Claude-Web) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (PerplexityBot|Perplexity-User) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Bytespider|CCBot|cohere-ai|Diffbot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FacebookBot|Meta-ExternalAgent) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Applebot-Extended|Google-Extended|Amazonbot) [NC]
RewriteRule .* - [F,L]
</IfModule>
What each piece does:
- RewriteCond %{HTTP_USER_AGENT} - checks the requesting client's User-Agent header against the pattern in parentheses.
- [NC] - case-insensitive match.
- [OR] - combines this condition with the next one as a logical OR. The final condition omits OR so the chain terminates.
- RewriteRule .* - [F,L] - if any condition matched, return 403 Forbidden ([F]) and stop processing ([L]).
A bad .htaccess rule can produce a 500 Internal Server Error and take your site offline. After saving changes, immediately load your homepage in a private browser window. If you get a 500, edit the file and revert. See our Complete Guide to htaccess for the full syntax reference.
07. Allowing Only Search Engines with htaccess
A stricter pattern: block any user-agent that looks like a bot but is not on your allow-list. This catches future AI bots automatically because they generally identify themselves with "bot" or "crawler" in the user-agent string.
<IfModule mod_rewrite.c>
RewriteEngine On
# Let legitimate search engines through
RewriteCond %{HTTP_USER_AGENT} (Googlebot|Bingbot|DuckDuckBot|Slurp|Baiduspider|YandexBot|AppleBot) [NC]
RewriteRule .* - [L]
# Block anything else that claims to be a bot or crawler
RewriteCond %{HTTP_USER_AGENT} (bot|crawler|spider|scrape|fetch) [NC]
RewriteRule .* - [F,L]
</IfModule>
The first rule short-circuits (with [L]) for trusted search engines, so the block rule that follows only applies to bots that did not match. Real browser visitors are not affected because they do not have "bot" or "crawler" in their user-agent.
This pattern is more aggressive. It will also block tools like uptime monitors, RSS readers, link checkers, and any legitimate service whose user-agent includes "bot" or "crawler". If you use those services, add their user-agents to the allow-list in the first rule.
08. Verifying Your Rules Are Working
After deploying either approach, test from the command line by sending requests with the offending user-agents. From any machine with curl installed:
# Should return 403 Forbidden after .htaccess block
curl -A "GPTBot" -I https://yourdomain.com/
# Should return 200 OK (Googlebot is allowed)
curl -A "Googlebot" -I https://yourdomain.com/
# Should return 200 OK (regular browser)
curl -I https://yourdomain.com/
The first line should return HTTP/2 403 or HTTP/1.1 403 Forbidden. The other two should return 200 OK.
For robots.txt verification, simply visit https://yourdomain.com/robots.txt in a browser. The file should display as plain text with your rules. Then check your access logs in cPanel (Metrics > Raw Access) over the next several days to confirm the bot user-agents are absent or returning 403.
Google publishes a robots.txt tester inside Search Console: Settings > robots.txt > Open report. It shows you exactly how Googlebot reads your file. There is not an equivalent tool for AI bots, so curl is the most reliable check.
09. Should You Block AI Bots At All?
There is a legitimate trade-off here. Reasons to allow AI bots:
- Visibility in AI answers. When someone asks Perplexity or ChatGPT about your product or topic, your content can appear in the answer. Blocking the retrieval bots removes you from that surface.
- Brand authority. Being cited as a source in AI responses builds the same kind of reputation that being cited in Wikipedia or major publications does.
- Future search. Google's AI Overviews and similar features pull from indexed content. Blocking too aggressively can hurt your visibility in search results, not just in AI summaries.
Reasons to block them:
- You do not want your content reused for training. AI models trained on your work may produce derivative output that competes with you, without attribution or compensation.
- Bandwidth and server load. Aggressive crawlers can eat a meaningful share of your hosting resources on busy sites. See Resource Limit Reached errors for what happens at the limit.
- You have a content moat. Paid newsletters, premium guides, or proprietary research are typically blocked to keep value behind your paywall.
A common middle ground is to allow live-retrieval bots (so you appear in real-time AI answers) but block training bots (so your content is not memorized into a model). With that goal, block: GPTBot, Google-Extended, ClaudeBot, anthropic-ai, Bytespider, CCBot, Applebot-Extended, Meta-ExternalAgent, Amazonbot, cohere-ai, and Diffbot. Allow: ChatGPT-User, OAI-SearchBot, Claude-Web, Perplexity-User.
If you are on a busy site and getting hammered by scrapers ignoring both robots.txt and .htaccess, Cloudflare's free plan includes automated bot management at the edge, which is more sophisticated than user-agent matching. See our Cloudflare Setup Guide for how to put your site behind it.
Need Help Configuring This?
If you would like a hand customizing the rules for your specific site or want us to deploy them for you, open a support ticket. Our team handles robots.txt and .htaccess setups on shared, VPS, and dedicated plans.
Open a Support TicketQuick Recap: AI Bot Control in Five Steps
If you only do five things from this guide, do these:
- Decide your goal - block training crawlers only, block all AI bots, or block everything except major search engines.
- Start with robots.txt - drop the block list from Section 04 into public_html/robots.txt. Covers cooperative bots with zero risk.
- Add htaccess only if needed - if logs show bots ignoring robots.txt, deploy the rule from Section 06. Test the homepage in a private window immediately after saving.
- Verify with curl - send a request with each blocked user-agent and confirm you get 403. Real browsers should still get 200 OK.
- Re-check quarterly - new AI bots appear every few months. Audit your logs for unfamiliar user-agents and extend the rules as needed.
Last updated June 2026 · Browse all Website Tools & SEO articles
