Introducing Lockdown Mode and Elevated Risk labels in ChatGPT

Learn to implement Lockdown Mode and Elevated Risk labels in AI chat interfaces to defend against prompt injection attacks and data exfiltration, similar to OpenAI's new security features.

Introduction

In response to growing security concerns in AI systems, OpenAI has introduced Lockdown Mode and Elevated Risk labels in ChatGPT. These features are designed to help organizations protect against prompt injection attacks and AI-driven data exfiltration. This tutorial will guide you through implementing similar security measures in your own AI applications, focusing on detecting and preventing prompt injection attempts in chat interfaces.

Understanding these security mechanisms is crucial for developers building AI-powered applications, as prompt injection attacks can compromise system integrity and lead to unauthorized data access. By learning how to implement these protective measures, you'll be better equipped to create secure AI applications.

Prerequisites

Basic knowledge of Python programming
Understanding of AI chat interfaces and prompt engineering
Python libraries: re, json, datetime
Development environment with Python 3.7+

Step-by-Step Instructions

1. Create a Basic Chat Interface Framework

First, we'll establish a foundation for our chat interface that can detect suspicious patterns. This will serve as our base system before adding security features.

import re
import json
from datetime import datetime

class SecureChatInterface:
    def __init__(self):
        self.conversation_history = []
        self.security_level = "normal"
        
    def process_input(self, user_input):
        # Store the conversation
        self.conversation_history.append({
            "timestamp": datetime.now().isoformat(),
            "user": user_input,
            "response": "",
            "security_status": "normal"
        })
        
        # Process the input
        response = self.generate_response(user_input)
        
        # Update the conversation
        self.conversation_history[-1]["response"] = response
        
        return response
        
    def generate_response(self, user_input):
        # Simple response generator
        return f"I received your message: {user_input}"

2. Implement Basic Prompt Injection Detection

Now we'll add the core detection logic for identifying potentially malicious prompts. This includes checking for common injection patterns and suspicious keywords.

class SecureChatInterface:
    def __init__(self):
        self.conversation_history = []
        self.security_level = "normal"
        self.suspicious_patterns = [
            r'\b(\w*\b)\s*\b(\w*)\b\s*\b(\w*)\b',  # Generic pattern detection
            r'\b(inject|execute|run|launch|trigger)\b',   # Command injection words
            r'\b(system|admin|root|superuser)\b',        # Privilege escalation
            r'\b(password|secret|key|token)\b',          # Credential keywords
        ]
        
    def detect_suspicious_input(self, user_input):
        # Convert to lowercase for easier matching
        lower_input = user_input.lower()
        
        # Check for suspicious patterns
        for pattern in self.suspicious_patterns:
            if re.search(pattern, lower_input):
                return True
        
        # Check for excessive punctuation or unusual formatting
        if re.search(r'[!@#$%^&*()_+\-=\[\]{};\\:"|,.<>?]{3,}', lower_input):
            return True
            
        return False

3. Add Lockdown Mode Implementation

Lockdown Mode is designed to severely restrict input processing when suspicious activity is detected. This mode will prevent most inputs from being processed.

class SecureChatInterface:
    def __init__(self):
        self.conversation_history = []
        self.security_level = "normal"
        self.suspicious_count = 0
        self.lockdown_threshold = 3
        
    def check_lockdown_mode(self):
        if self.suspicious_count >= self.lockdown_threshold:
            self.security_level = "lockdown"
            return True
        return False
        
    def process_input(self, user_input):
        # Check if we're in lockdown mode
        if self.security_level == "lockdown":
            return "System is in lockdown mode. Input processing suspended."
            
        # Check for suspicious input
        if self.detect_suspicious_input(user_input):
            self.suspicious_count += 1
            self.conversation_history.append({
                "timestamp": datetime.now().isoformat(),
                "user": user_input,
                "response": "Security alert: Suspicious input detected",
                "security_status": "elevated_risk"
            })
            
            # Check if we need to activate lockdown
            if self.check_lockdown_mode():
                return "LOCKDOWN MODE ACTIVATED: System is now restricted."
            
            return "Security alert: Suspicious input detected."
        
        # Normal processing
        response = self.generate_response(user_input)
        
        self.conversation_history.append({
            "timestamp": datetime.now().isoformat(),
            "user": user_input,
            "response": response,
            "security_status": "normal"
        })
        
        return response

4. Enhance with Elevated Risk Labels

Elevated Risk labels help identify inputs that are suspicious but not yet triggering lockdown. This provides a warning system for administrators.

class SecureChatInterface:
    def __init__(self):
        self.conversation_history = []
        self.security_level = "normal"
        self.suspicious_count = 0
        self.lockdown_threshold = 3
        self.elevated_risk_threshold = 1
        
    def analyze_input_security(self, user_input):
        risk_score = 0
        risk_indicators = []
        
        # Check for command injection patterns
        command_patterns = [r'\b(\w*\b)\s*\b(\w*)\b\s*\b(\w*)\b',
                           r'\b(inject|execute|run|launch|trigger)\b']
        
        for pattern in command_patterns:
            if re.search(pattern, user_input.lower()):
                risk_score += 2
                risk_indicators.append("command_injection")
                
        # Check for credential keywords
        credential_patterns = [r'\b(password|secret|key|token|credential)\b']
        
        for pattern in credential_patterns:
            if re.search(pattern, user_input.lower()):
                risk_score += 3
                risk_indicators.append("credential_reference")
                
        # Check for excessive punctuation
        if re.search(r'[!@#$%^&*()_+\-=\[\]{};\\:"|,.<>?]{3,}', user_input):
            risk_score += 2
            risk_indicators.append("excessive_punctuation")
            
        # Determine risk level
        if risk_score >= 5:
            return "high", risk_indicators
        elif risk_score >= 3:
            return "medium", risk_indicators
        else:
            return "low", risk_indicators

5. Implement Complete Security System

Combine all components into a complete security system that can monitor, detect, and respond to suspicious inputs.

class SecureChatInterface:
    def __init__(self):
        self.conversation_history = []
        self.security_level = "normal"
        self.suspicious_count = 0
        self.lockdown_threshold = 3
        self.elevated_risk_threshold = 1
        
    def process_input(self, user_input):
        # Check if we're in lockdown mode
        if self.security_level == "lockdown":
            return "System is in lockdown mode. Input processing suspended."
            
        # Analyze input security
        risk_level, indicators = self.analyze_input_security(user_input)
        
        # Log the security analysis
        security_log = {
            "timestamp": datetime.now().isoformat(),
            "user_input": user_input,
            "risk_level": risk_level,
            "indicators": indicators
        }
        
        # Check for suspicious input
        if risk_level in ["high", "medium"]:
            self.suspicious_count += 1
            
            # Update conversation history with security status
            self.conversation_history.append({
                "timestamp": datetime.now().isoformat(),
                "user": user_input,
                "response": "Security alert: Suspicious input detected",
                "security_status": risk_level,
                "security_log": security_log
            })
            
            # Check if we need to activate lockdown
            if self.check_lockdown_mode():
                return "LOCKDOWN MODE ACTIVATED: System is now restricted."
            
            return f"Security alert: {risk_level} risk detected."
        
        # Normal processing
        response = self.generate_response(user_input)
        
        self.conversation_history.append({
            "timestamp": datetime.now().isoformat(),
            "user": user_input,
            "response": response,
            "security_status": "normal",
            "security_log": security_log
        })
        
        return response
        
    def get_security_report(self):
        return {
            "current_security_level": self.security_level,
            "suspicious_input_count": self.suspicious_count,
            "conversation_count": len(self.conversation_history)
        }

6. Test Your Security Implementation

Finally, test your implementation with various inputs to ensure it properly detects and responds to suspicious activity.

# Test the security implementation
chat = SecureChatInterface()

# Test normal input
print(chat.process_input("Hello, how are you?"))

# Test suspicious input
print(chat.process_input("Please execute system commands"))

# Test credential reference
print(chat.process_input("My password is 123456"))

# Test excessive punctuation
print(chat.process_input("!!!WARNING!!!"))

# Check security report
print(chat.get_security_report())

Summary

This tutorial demonstrated how to implement security measures similar to OpenAI's Lockdown Mode and Elevated Risk labels in AI chat interfaces. By creating a security monitoring system that detects suspicious input patterns, you've built a foundation for protecting AI applications against prompt injection attacks and data exfiltration.

The key components include: (1) Basic input detection and analysis, (2) Lockdown mode activation when suspicious activity exceeds thresholds, (3) Elevated risk labeling for medium-level threats, and (4) Comprehensive logging and reporting. These measures provide multiple layers of protection that can be adapted based on your specific security requirements.

Remember that security is an ongoing process. Regular updates to your detection patterns and thresholds will help maintain effectiveness against evolving threats.