Introduction
In response to growing concerns about AI safety, particularly regarding young users, OpenAI has open-sourced its teen safety policies. This tutorial will guide you through creating a basic safety monitoring system for AI applications, similar to what OpenAI has shared. You'll learn how to implement safeguards that help protect vulnerable users, especially teenagers, when interacting with AI systems.
This is a beginner-friendly tutorial that will teach you to build a simple safety filter that can detect potentially harmful content and prompt appropriate warnings or responses.
Prerequisites
To follow this tutorial, you'll need:
- A computer with internet access
- Basic understanding of Python programming
- Python 3.6 or higher installed on your system
- Access to a Python IDE or text editor (like VS Code or PyCharm)
- Basic understanding of how AI models work (no advanced knowledge required)
Step-by-Step Instructions
1. Set Up Your Python Environment
First, we need to create a Python environment for our project. Open your terminal or command prompt and create a new directory for this project:
mkdir ai_safety_monitor
cd ai_safety_monitor
Next, create a virtual environment to keep our project dependencies isolated:
python -m venv safety_env
source safety_env/bin/activate # On Windows use: safety_env\Scripts\activate
Now, install the required packages:
pip install nltk
2. Create the Safety Policy Framework
Open your code editor and create a new file called safety_monitor.py. This file will contain our core safety monitoring logic:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import re
# Download required NLTK data
nltk.download('vader_lexicon')
# Define safety policies
SAFETY_POLICIES = {
'harmful_keywords': [
'kill', 'die', 'suicide', 'hurt', 'harm', 'death', 'kill myself',
'end my life', 'want to die', 'self-harm', 'hurt myself'
],
'sensitive_topics': [
'mental health', 'depression', 'anxiety', 'trauma', 'abuse'
],
'age_restriction': 18
}
# Initialize sentiment analyzer
sia = SentimentIntensityAnalyzer()
class SafetyMonitor:
def __init__(self):
self.violations = []
def check_text_safety(self, text):
"""Check if text violates safety policies"""
# Check for harmful keywords
for keyword in SAFETY_POLICIES['harmful_keywords']:
if keyword in text.lower():
self.violations.append(f"Harmful keyword detected: {keyword}")
return False
# Check for sensitive topics
for topic in SAFETY_POLICIES['sensitive_topics']:
if topic in text.lower():
self.violations.append(f"Sensitive topic detected: {topic}")
return False
# Analyze sentiment
sentiment = sia.polarity_scores(text)
if sentiment['compound'] < -0.5: # Very negative sentiment
self.violations.append("Very negative sentiment detected")
return False
return True
def get_violations(self):
return self.violations
def reset_violations(self):
self.violations = []
3. Implement User Input Handling
Now we need to create a simple way to test our safety monitor with user input:
def main():
monitor = SafetyMonitor()
print("AI Safety Monitor - Type 'quit' to exit")
print("This system checks for potentially harmful content.")
print("\n" + "="*50)
while True:
user_input = input("\nEnter text to check: ")
if user_input.lower() == 'quit':
break
# Check safety
is_safe = monitor.check_text_safety(user_input)
if is_safe:
print("✅ Content is safe for all users.")
else:
print("⚠️ Potential safety violation detected!")
violations = monitor.get_violations()
for violation in violations:
print(f" - {violation}")
print(" Suggestion: Redirect to mental health resources or counselor.")
# Reset violations for next check
monitor.reset_violations()
if __name__ == "__main__":
main()
4. Run the Safety Monitor
Save your safety_monitor.py file and run it in your terminal:
python safety_monitor.py
You should see the program start and prompt you for input. Try entering different types of text:
- Normal text like "Hello, how are you?"
- Potentially harmful text like "I want to die"
- Sensitive topics like "I'm feeling depressed"
5. Test the Safety Policies
Try running some tests to see how your safety monitor responds:
# Test 1: Normal text
Input: "How are you today?"
Output: ✅ Content is safe for all users.
# Test 2: Harmful keyword
Input: "I want to kill myself"
Output: ⚠️ Potential safety violation detected!
- Harmful keyword detected: kill myself
Suggestion: Redirect to mental health resources or counselor.
# Test 3: Sensitive topic
Input: "I'm having anxiety problems"
Output: ⚠️ Potential safety violation detected!
- Sensitive topic detected: anxiety
Suggestion: Redirect to mental health resources or counselor.
6. Enhance the Safety Monitor
For a more advanced version, you can enhance the safety monitor by adding:
- Integration with external APIs for mental health resources
- More sophisticated natural language processing
- User age verification systems
- Logging and reporting features
Here's an example of how to add a basic resource redirect:
def suggest_resources(self):
"""Suggest mental health resources for users"""
resources = [
"National Suicide Prevention Lifeline: 988",
"Crisis Text Line: Text HOME to 741741",
"Crisis Chat: https://www.crisischat.org"
]
return resources
Summary
In this tutorial, you've created a basic AI safety monitoring system that mimics some of the safety policies OpenAI has open-sourced. This system helps detect potentially harmful content and suggests appropriate actions when safety violations are found.
The key concepts covered include:
- Creating a safety policy framework with keyword detection
- Using sentiment analysis to identify negative content
- Building a user-friendly interface for testing
- Implementing basic safety violation handling
While this is a simplified version, it demonstrates the core principles behind AI safety monitoring. Real-world applications would require more sophisticated approaches, including integration with actual AI models, more comprehensive databases of harmful content, and proper user verification systems.
This foundation can be expanded upon to build more robust safety systems that protect vulnerable users while maintaining the utility of AI applications.



