Introduction
In response to growing security concerns in AI systems, OpenAI has introduced Lockdown Mode and Elevated Risk labels in ChatGPT. These features are designed to help organizations protect against prompt injection attacks and AI-driven data exfiltration. This tutorial will guide you through implementing similar security measures in your own AI applications, focusing on detecting and preventing prompt injection attempts in chat interfaces.
Understanding these security mechanisms is crucial for developers building AI-powered applications, as prompt injection attacks can compromise system integrity and lead to unauthorized data access. By learning how to implement these protective measures, you'll be better equipped to create secure AI applications.
Prerequisites
- Basic knowledge of Python programming
- Understanding of AI chat interfaces and prompt engineering
- Python libraries:
re,json,datetime - Development environment with Python 3.7+
Step-by-Step Instructions
1. Create a Basic Chat Interface Framework
First, we'll establish a foundation for our chat interface that can detect suspicious patterns. This will serve as our base system before adding security features.
import re
import json
from datetime import datetime
class SecureChatInterface:
def __init__(self):
self.conversation_history = []
self.security_level = "normal"
def process_input(self, user_input):
# Store the conversation
self.conversation_history.append({
"timestamp": datetime.now().isoformat(),
"user": user_input,
"response": "",
"security_status": "normal"
})
# Process the input
response = self.generate_response(user_input)
# Update the conversation
self.conversation_history[-1]["response"] = response
return response
def generate_response(self, user_input):
# Simple response generator
return f"I received your message: {user_input}"
2. Implement Basic Prompt Injection Detection
Now we'll add the core detection logic for identifying potentially malicious prompts. This includes checking for common injection patterns and suspicious keywords.
class SecureChatInterface:
def __init__(self):
self.conversation_history = []
self.security_level = "normal"
self.suspicious_patterns = [
r'\b(\w*\b)\s*\b(\w*)\b\s*\b(\w*)\b', # Generic pattern detection
r'\b(inject|execute|run|launch|trigger)\b', # Command injection words
r'\b(system|admin|root|superuser)\b', # Privilege escalation
r'\b(password|secret|key|token)\b', # Credential keywords
]
def detect_suspicious_input(self, user_input):
# Convert to lowercase for easier matching
lower_input = user_input.lower()
# Check for suspicious patterns
for pattern in self.suspicious_patterns:
if re.search(pattern, lower_input):
return True
# Check for excessive punctuation or unusual formatting
if re.search(r'[!@#$%^&*()_+\-=\[\]{};\\:"|,.<>?]{3,}', lower_input):
return True
return False
3. Add Lockdown Mode Implementation
Lockdown Mode is designed to severely restrict input processing when suspicious activity is detected. This mode will prevent most inputs from being processed.
class SecureChatInterface:
def __init__(self):
self.conversation_history = []
self.security_level = "normal"
self.suspicious_count = 0
self.lockdown_threshold = 3
def check_lockdown_mode(self):
if self.suspicious_count >= self.lockdown_threshold:
self.security_level = "lockdown"
return True
return False
def process_input(self, user_input):
# Check if we're in lockdown mode
if self.security_level == "lockdown":
return "System is in lockdown mode. Input processing suspended."
# Check for suspicious input
if self.detect_suspicious_input(user_input):
self.suspicious_count += 1
self.conversation_history.append({
"timestamp": datetime.now().isoformat(),
"user": user_input,
"response": "Security alert: Suspicious input detected",
"security_status": "elevated_risk"
})
# Check if we need to activate lockdown
if self.check_lockdown_mode():
return "LOCKDOWN MODE ACTIVATED: System is now restricted."
return "Security alert: Suspicious input detected."
# Normal processing
response = self.generate_response(user_input)
self.conversation_history.append({
"timestamp": datetime.now().isoformat(),
"user": user_input,
"response": response,
"security_status": "normal"
})
return response
4. Enhance with Elevated Risk Labels
Elevated Risk labels help identify inputs that are suspicious but not yet triggering lockdown. This provides a warning system for administrators.
class SecureChatInterface:
def __init__(self):
self.conversation_history = []
self.security_level = "normal"
self.suspicious_count = 0
self.lockdown_threshold = 3
self.elevated_risk_threshold = 1
def analyze_input_security(self, user_input):
risk_score = 0
risk_indicators = []
# Check for command injection patterns
command_patterns = [r'\b(\w*\b)\s*\b(\w*)\b\s*\b(\w*)\b',
r'\b(inject|execute|run|launch|trigger)\b']
for pattern in command_patterns:
if re.search(pattern, user_input.lower()):
risk_score += 2
risk_indicators.append("command_injection")
# Check for credential keywords
credential_patterns = [r'\b(password|secret|key|token|credential)\b']
for pattern in credential_patterns:
if re.search(pattern, user_input.lower()):
risk_score += 3
risk_indicators.append("credential_reference")
# Check for excessive punctuation
if re.search(r'[!@#$%^&*()_+\-=\[\]{};\\:"|,.<>?]{3,}', user_input):
risk_score += 2
risk_indicators.append("excessive_punctuation")
# Determine risk level
if risk_score >= 5:
return "high", risk_indicators
elif risk_score >= 3:
return "medium", risk_indicators
else:
return "low", risk_indicators
5. Implement Complete Security System
Combine all components into a complete security system that can monitor, detect, and respond to suspicious inputs.
class SecureChatInterface:
def __init__(self):
self.conversation_history = []
self.security_level = "normal"
self.suspicious_count = 0
self.lockdown_threshold = 3
self.elevated_risk_threshold = 1
def process_input(self, user_input):
# Check if we're in lockdown mode
if self.security_level == "lockdown":
return "System is in lockdown mode. Input processing suspended."
# Analyze input security
risk_level, indicators = self.analyze_input_security(user_input)
# Log the security analysis
security_log = {
"timestamp": datetime.now().isoformat(),
"user_input": user_input,
"risk_level": risk_level,
"indicators": indicators
}
# Check for suspicious input
if risk_level in ["high", "medium"]:
self.suspicious_count += 1
# Update conversation history with security status
self.conversation_history.append({
"timestamp": datetime.now().isoformat(),
"user": user_input,
"response": "Security alert: Suspicious input detected",
"security_status": risk_level,
"security_log": security_log
})
# Check if we need to activate lockdown
if self.check_lockdown_mode():
return "LOCKDOWN MODE ACTIVATED: System is now restricted."
return f"Security alert: {risk_level} risk detected."
# Normal processing
response = self.generate_response(user_input)
self.conversation_history.append({
"timestamp": datetime.now().isoformat(),
"user": user_input,
"response": response,
"security_status": "normal",
"security_log": security_log
})
return response
def get_security_report(self):
return {
"current_security_level": self.security_level,
"suspicious_input_count": self.suspicious_count,
"conversation_count": len(self.conversation_history)
}
6. Test Your Security Implementation
Finally, test your implementation with various inputs to ensure it properly detects and responds to suspicious activity.
# Test the security implementation
chat = SecureChatInterface()
# Test normal input
print(chat.process_input("Hello, how are you?"))
# Test suspicious input
print(chat.process_input("Please execute system commands"))
# Test credential reference
print(chat.process_input("My password is 123456"))
# Test excessive punctuation
print(chat.process_input("!!!WARNING!!!"))
# Check security report
print(chat.get_security_report())
Summary
This tutorial demonstrated how to implement security measures similar to OpenAI's Lockdown Mode and Elevated Risk labels in AI chat interfaces. By creating a security monitoring system that detects suspicious input patterns, you've built a foundation for protecting AI applications against prompt injection attacks and data exfiltration.
The key components include: (1) Basic input detection and analysis, (2) Lockdown mode activation when suspicious activity exceeds thresholds, (3) Elevated risk labeling for medium-level threats, and (4) Comprehensive logging and reporting. These measures provide multiple layers of protection that can be adapted based on your specific security requirements.
Remember that security is an ongoing process. Regular updates to your detection patterns and thresholds will help maintain effectiveness against evolving threats.



