Introduction
In a recent article, Cloudflare CEO Matthew Prince predicted that bot traffic will soon overtake human traffic on the internet, calling the future of the web "pay to crawl." This means that websites may need to implement payment systems or advanced security measures to distinguish between legitimate users and automated bots. In this beginner-friendly tutorial, we'll explore how to build a basic bot detection system using Python and simple web scraping techniques. This will help you understand how to protect your website from excessive bot traffic.
Prerequisites
- A basic understanding of Python programming
- Python installed on your computer
- Basic knowledge of web scraping concepts
- Install the following Python libraries:
requests,beautifulsoup4, anduser-agents
Step-by-step instructions
Step 1: Setting up Your Python Environment
Before we start coding, we need to install the required Python libraries. Open your terminal or command prompt and run the following commands:
pip install requests beautifulsoup4 user-agents
Why: These libraries will help us make HTTP requests, parse HTML content, and analyze user agent strings to detect bot behavior.
Step 2: Creating a Basic Web Scraper
Let's start by creating a simple web scraper that fetches content from a website:
import requests
from bs4 import BeautifulSoup
# Define the URL to scrape
url = 'https://httpbin.org/user-agent'
# Make a GET request to the URL
response = requests.get(url)
# Print the response content
print(response.text)
Why: This code demonstrates how to make a request to a website and retrieve its content. We'll use this as the foundation for detecting bot behavior.
Step 3: Analyzing User Agents
Bot traffic often uses different user agents than human browsers. Let's create a script to analyze user agents:
from user_agents import parse
# Example user agent strings
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]
for ua in user_agents:
user_agent = parse(ua)
print(f'User Agent: {ua}')
print(f'Browser: {user_agent.browser.family} {user_agent.browser.version_string}')
print(f'OS: {user_agent.os.family} {user_agent.os.version_string}')
print(f'Device: {user_agent.device.family}')
print(f'Is Bot: {user_agent.is_bot}')
print('---')
Why: By analyzing user agents, we can identify if a request is coming from a bot. This is a crucial step in detecting bot traffic.
Step 4: Building a Bot Detection Function
Now, let's create a function that can detect bot traffic based on user agent analysis:
def is_bot(user_agent_string):
parsed_ua = parse(user_agent_string)
return parsed_ua.is_bot
# Test the function
bot_ua = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
human_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
print(f'Is bot (Googlebot): {is_bot(bot_ua)}')
print(f'Is bot (Human): {is_bot(human_ua)}')
Why: This function will help us quickly identify whether a request is coming from a bot or a human user, which is essential for managing website traffic.
Step 5: Simulating Bot Traffic Detection
Let's simulate how a website might detect bot traffic by creating a simple traffic logger:
import time
from datetime import datetime
class TrafficLogger:
def __init__(self):
self.bot_count = 0
self.human_count = 0
def log_visit(self, user_agent):
if is_bot(user_agent):
self.bot_count += 1
print(f'[BOT] {datetime.now()}: Detected bot traffic')
else:
self.human_count += 1
print(f'[HUMAN] {datetime.now()}: Detected human traffic')
# Create a traffic logger instance
logger = TrafficLogger()
# Simulate some visits
visits = [
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'
]
for visit in visits:
logger.log_visit(visit)
time.sleep(1) # Simulate time delay
print(f'Total bots: {logger.bot_count}')
print(f'Total humans: {logger.human_count}')
Why: This logger simulates how a website might track and analyze traffic to identify bot activity, which is a key part of protecting your site from excessive bot traffic.
Step 6: Implementing Basic Protection Measures
Finally, let's add a simple protection mechanism that blocks known bot traffic:
def protect_website(user_agent):
if is_bot(user_agent):
print('Access denied: Bot traffic detected')
return False
else:
print('Access granted: Human traffic detected')
return True
# Test the protection mechanism
protect_website('Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)')
protect_website('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
Why: This protection mechanism demonstrates how a website can block bot traffic to prevent resource exhaustion and maintain a better user experience for real users.
Summary
In this tutorial, we've learned how to detect bot traffic using Python. We created a simple bot detection system that analyzes user agents and simulates traffic logging and protection mechanisms. While this is a basic implementation, it demonstrates the core concepts behind protecting websites from excessive bot traffic. As the internet becomes increasingly dominated by bots, understanding how to identify and manage this traffic is crucial for maintaining a healthy online environment.



