Q&A on How Web Application Firewalls Help Stop AI Bot Traffic

AI Bot We’ve seen a steep increase in AI bot traffic across client websites. In many cases, it’s not just a mild nuisance. It creates performance slowdowns, distorts analytics, causes fake conversions, and introduces new security concerns. Many organizations are now turning to web application firewalls as their first line of defense.

Engine Room’s SVP of Technology and Information Security, Ian Lebbern, has answered the most common questions we hear to help you understand what these bots are doing, why they’re different, and how you can take action.

What kinds of issues are you seeing on client websites due to AI bot traffic?

AI bot traffic can cause a variety of issues for websites, impacting everything from performance and analytics to security. Bots consume significant server resources and bandwidth, leading to higher operational costs for website owners. The sheer volume of bot traffic can overwhelm web servers, resulting in slower load times, reduced responsiveness, and even complete website outages.

Bot traffic can also heavily skew website analytics, making it difficult to accurately measure key metrics like page views, bounce rate, session duration, and conversion rates. This “statistical noise” makes it challenging for marketers to understand true user behavior and make informed decisions about site improvements such as A/B testing or conversion rate optimization.

Fake conversions are also a concern today. Advanced spam bots generate fake lead generation sign-ups and form submissions, cluttering data with junk and consuming valuable resources.

From a security standpoint, AI bots are frequently used for data scraping and illicitly gathering information from websites. Once collected, this data can be used for various malicious intentions including content theft, which may harm SEO, or to create fake storefronts and phishing sites. Malicious bots are also frequently used to launch Distributed Denial-of-Service (DDoS) attacks that flood a website with so much superfluous traffic that it becomes unavailable.

How does traffic from AI crawlers differ from legitimate traffic like search engine bots or human visitors?

Understanding the differences between AI crawler traffic, legitimate search engine bots, and human visitors is crucial for website owners to accurately analyze their data, manage resources, and maintain security. At a high level, human visitors exhibit diverse and unpredictable behavior, i.e., they tend to navigate websites in varied, often non-linear ways.

Legitimate search engine bots, such as Googlebot and Bingbot, crawl websites in an orderly, rules-abiding fashion, often respecting crawl delays set by website owners to avoid overwhelming servers. They identify themselves with specific user-agent strings, and while they crawl many pages, their patterns are often more predictable than malicious bots, allowing for easier identification and management.

In contrast, AI crawler traffic, including bots like GPTBot, ClaudeBot, and PerplexityBot, often requests large batches of pages in short bursts, disregarding typical bandwidth-saving guidelines or crawl delays. They might also be seen crawling the same pages repeatedly. Some AI crawlers do identify themselves, but many others may use generic, outdated, or even spoofed user-agent strings to evade detection.

Can you explain how AI bots collect content and how that content is typically used?

AI bots collect content primarily through web scraping and API integration, often at a much larger scale and with greater sophistication than traditional methods. The collected content then serves various purposes, most notably training large language models (LLMs) and powering AI-driven applications.

Web scraping involves automated, recursive crawling where AI bots navigate the internet by following links. Unlike traditional scrapers, they often use headless browsers (browsers without a graphical user interface) to fully render web pages, including JavaScript-generated content that static scrapers frequently miss. Once they receive the HTML content of a page, they parse it to extract text, images, videos, and other data.

API integration connects AI bots to different systems and platforms, allowing them to access data, perform actions, and enhance user experiences.

The collected content is primarily used as training data for various AI models, particularly Large Language Models and generative AI.

What are some of the risks website owners face when AI bots crawl their sites excessively or without permission?

In addition to the performance, analytics, and security issues already discussed, excessive or unauthorized AI bot crawling poses risks related to content and business models. Data scraping by AI bots raises copyright concerns because much website content is protected. Content creators, news organizations, and artists are increasingly suing AI companies for using copyrighted material without permission or compensation.

Legal and ethical risks also arise. Many websites have Terms of Service that explicitly prohibit automated scraping. Excessive or unauthorized AI bot crawling is a violation of these terms and can be legally enforced in some jurisdictions.

Can you share an example of a real-world impact caused by unregulated AI bot activity?

Yes, there was a recent incident where a client website triggered multiple uptime alerts and experienced frequent intermittent outages. The downtime was directly caused by over-consumption of server resources by an AI bot.

Which AI bots are the most active right now, and what do we know about who operates them and why?

Popular generative AI bots include ChatGPT (OpenAI), Microsoft Copilot, Google Gemini, PerplexityBot (Perplexity), Claude AI (Anthropic), and Llama (Meta). These bots operate for different reasons. OpenAI develops ChatGPT to advance AI technology, while Anthropic operates Claude AI with a focus on ethical AI.

What role does a Web Application Firewall (WAF) play in managing or blocking AI bot traffic?

Web application firewalls play a crucial role in managing and blocking AI bot traffic by acting as a first line of defense for web applications. Originally designed to protect against common web vulnerabilities like SQL injection and cross-site scripting, modern web application firewalls now include sophisticated bot management capabilities tailored to the challenges posed by AI bots.

These firewalls can activate rule-based blocking, such as identifying and blocking requests based on their User-Agent strings. Known AI bot user-agents like “GPTBot” can be directly blocked. They also use rate-limiting to restrict the number of requests from a single IP address or user within a certain timeframe. Excessive requests typical of aggressive AI crawlers trigger these limits, resulting in blocking or throttling.

In fact, as of this writing, Cloudflare just announced that it will block AI bots from scraping content by default

What are "bot management capabilities," and how do they work in practical terms?

Bot management capabilities are features within security solutions designed to detect, analyze, and control automated bot traffic interacting with websites, APIs, and mobile apps. The goal is to differentiate beneficial bots (like search engine crawlers) from malicious or unwanted bots (like scrapers, spammers, or credential stuffers), allowing the good and blocking the bad.

Practically, for example, an e-commerce site facing inventory hoarding during a flash sale may notice hundreds of requests for a product added to carts from different IPs within seconds. Even if the User-Agent is generic, behavioral analysis can detect no mouse movement or a linear path to the “add to cart” button. The bot management solution can block the IP or serve fake inventory data to the bot, tricking it into thinking it succeeded while preventing actual purchases.

Is it possible to selectively allow or block certain bots? How should website operators make those decisions?

Yes, selectively allowing or blocking bots is both possible and necessary. Blocking legitimate bots can harm the site, while allowing malicious bots can cause damage.

Decisions to allow or block should be based on understanding the bot’s identity, purpose, behavior, and potential impact on the website and business.

There’s growing concern over AI companies scraping content for training without permission. What should website owners know about their rights or obligations?

The practice of AI companies scraping content without permission raises significant legal and operational issues. Most original website content is copyrighted. AI scraping for training may infringe copyright, despite debates around “fair use.” Legal challenges, such as The New York Times vs. OpenAI, highlight this struggle.

If AI bots scrape personally identifiable information without consent, this can violate privacy regulations like CCPA, leading to fines.

Most websites include Terms of Service prohibiting automated scraping. Violating these can lead to breach of contract claims, especially if harm is shown.

Key best practices include clear Terms of Service explicitly banning automated scraping and AI training use, using a robots.txt file signaling disallowance of AI scraping, and employing robust web application firewalls with advanced bot management.

Are there policy changes or standards emerging to regulate AI crawler behavior?

Yes, policy changes and standards are emerging but remain dynamic and fragmented. Efforts focus on giving website owners granular control over AI’s use of their content.

Extensions to the Robots Exclusion Protocol (REP) are proposed, including new directives like “DisallowAITraining,” explicitly instructing AI crawlers not to use content for training. However, enforcement and adoption vary.

What’s one first step website managers should take to audit or improve bot defenses?

The first step is to analyze existing website traffic data for anomalies and suspicious patterns. This “data-first” approach is crucial since you cannot defend against what you don’t understand. Examining traffic gives insights into bot types, volume, behavior, and impacts.

What kinds of tools or services can help with this issue?

Technical consulting and strategy services that include website performance analysis and bot mitigation recommendations are essential. Website development services tailored to improve bot defenses also play a critical role.

How can someone tell if they’re being hit by an AI bot — and what should they do?

Detecting unwanted AI bot traffic requires vigilance, technical understanding, and the right tools. Deep analysis of web traffic, server logs, and application logs helps identify unusual patterns.

Once AI bot traffic is confirmed, responses depend on whether the bot is legitimate but aggressive or malicious. Appropriate mitigation actions, including configuring web application firewalls, should be taken.

Engine Room Makes Managing AI Bot Traffic Easier

Engine Room provides expert analysis and tailored solutions to help detect and control AI bot traffic. Our services include website performance reviews and implementing web application firewalls with advanced bot management. We help protect your site, keep your data accurate, and maintain smooth user experiences.

If AI bots are disrupting your website, Engine Room can help you take back control. Schedule a free consultation with our experts to get started.