Introduction
In this tutorial, you'll learn how to detect and handle invisible Unicode characters in code repositories - a technique that attackers have exploited in recent supply-chain attacks. Understanding these invisible characters is crucial for maintaining code security and preventing malicious modifications that can go unnoticed. We'll walk through practical steps to identify these hidden characters in your code files and create a simple detection tool.
Prerequisites
- A basic understanding of programming concepts
- Python installed on your computer
- A text editor or IDE (like VS Code or Sublime Text)
- Access to a code repository or text files to examine
Step 1: Understanding Invisible Unicode Characters
What Are They?
Invisible Unicode characters are special symbols that appear as empty space or look identical to regular characters but have different code points. Attackers use these to hide malicious code or create subtle modifications that bypass normal code reviews.
Why This Matters
These characters can be used in supply-chain attacks to inject malicious code that looks identical to legitimate code. They're particularly dangerous because they're invisible to the human eye, making them extremely hard to detect during manual code reviews.
Step 2: Setting Up Your Environment
Create a Test Directory
First, create a new directory for our experiment:
mkdir unicode_detector
cd unicode_detector
Install Required Tools
We'll use Python for our detection tool. No special libraries are needed for basic detection, but we'll install the 'unicodedata' module which comes with Python by default.
Step 3: Creating a Sample Test File
Creating a Test File with Invisible Characters
Let's create a test file that contains both regular and invisible Unicode characters:
touch test_file.py
Open the file in your text editor and add this content:
# This is a test file
# Regular code
print("Hello World")
# This line contains invisible Unicode characters
print("Hello World" + "\u200B") # Zero-width space character
# Another invisible character
x = "test\u00A0" # Non-breaking space
Why We're Using These Characters
The characters we're using are common invisible Unicode characters that can be used in attacks:
\u200B- Zero-width space (invisible but takes up space)\u00A0- Non-breaking space (looks like regular space but is different)
Step 4: Writing the Detection Script
Creating the Detection Tool
Create a Python script called detect_unicode.py:
touch detect_unicode.py
Open the file and add this code:
#!/usr/bin/env python3
import unicodedata
import sys
# List of invisible Unicode characters to detect
INVISIBLE_CHARS = [
'\u0000', # Null character
'\u0001', # Start of Header
'\u0002', # Start of Text
'\u0003', # End of Text
'\u0004', # End of Transmission
'\u0005', # Enquiry
'\u0006', # Acknowledge
'\uu0007', # Bell
'\u0008', # Backspace
'\u000B', # Vertical Tab
'\u000C', # Form Feed
'\u000E', # Shift Out
'\u000F', # Shift In
'\u0010', # Data Link Escape
'\u0011', # Device Control 1
'\u0012', # Device Control 2
'\u0013', # Device Control 3
'\u0014', # Device Control 4
'\u0015', # Negative Acknowledge
'\u0016', # Synchronous Idle
'\u0017', # End of Transmission Block
'\u0018', # Cancel
'\u0019', # End of Medium
'\u001A', # Substitute
'\u001B', # Escape
'\u001C', # File Separator
'\u001D', # Group Separator
'\u001E', # Record Separator
'\u001F', # Unit Separator
'\u00A0', # Non-breaking space
'\u2000', # En Quad
'\u2001', # Em Quad
'\u2002', # En Space
'\u2003', # Em Space
'\u2004', # Three-Per-Em Space
'\u2005', # Four-Per-Em Space
'\u2006', # Six-Per-Em Space
'\u2007', # Figure Space
'\u2008', # Punctuation Space
'\u2009', # Thin Space
'\u200A', # Hair Space
'\u200B', # Zero-width space
'\u200C', # Zero-width non-joiner
'\u200D', # Zero-width joiner
'\u200E', # Left-to-right mark
'\u200F', # Right-to-left mark
'\u2028', # Line separator
'\u2029', # Paragraph separator
'\u202A', # Left-to-right embedding
'\u202B', # Right-to-left embedding
'\u202C', # Pop directional formatting
'\u202D', # Left-to-right override
'\u202E', # Right-to-left override
'\u2060', # Word joiner
'\u2061', # Function application
'\u2062', # Invisible times
'\u2063', # Invisible separator
'\u2064', # Invisible plus
'\u2065', # Medium mathematical space
'\u2066', # Left-to-right isolate
'\u2067', # Right-to-left isolate
'\u2068', # First strong isolate
'\u2069', # Pop directional isolate
'\u206A', # Inhibit symmetric swapping
'\u206B', # Activate symmetric swapping
'\u206C', # Inhibit arabic form shaping
'\u206D', # Activate arabic form shaping
'\u206E', # National digit shapes
'\u206F', # Nominal digit shapes
'\uFEFF', # Zero width no-break space (BOM)
]
def is_invisible_char(char):
"""Check if a character is an invisible Unicode character"""
return char in INVISIBLE_CHARS
def detect_invisible_chars(filename):
"""Detect invisible Unicode characters in a file"""
print(f"Analyzing file: {filename}")
print("""\nInvisible characters found:""")
with open(filename, 'r', encoding='utf-8') as file:
lines = file.readlines()
line_number = 0
for line in lines:
line_number += 1
for char_index, char in enumerate(line):
if is_invisible_char(char):
# Get character name for display
try:
char_name = unicodedata.name(char)
except ValueError:
char_name = "Unknown character"
print(f" Line {line_number}, Position {char_index}: '{char}' ({char_name})")
def main():
if len(sys.argv) != 2:
print("Usage: python detect_unicode.py ")
sys.exit(1)
filename = sys.argv[1]
detect_invisible_chars(filename)
if __name__ == "__main__":
main()
Why This Code Works
This script creates a comprehensive list of invisible Unicode characters and checks each character in a file against this list. When it finds an invisible character, it displays its position in the file and its Unicode name for identification.
Step 5: Running the Detection Tool
Make the Script Executable
chmod +x detect_unicode.py
Test the Tool
Run the detection script on our test file:
python detect_unicode.py test_file.py
What to Expect
You should see output showing the positions of invisible characters in your test file. The tool will display line numbers, character positions, and the actual invisible characters with their Unicode names.
Step 6: Understanding the Results
Interpreting the Output
The output will show something like:
Analyzing file: test_file.py
Invisible characters found:
Line 7, Position 22: '' (ZERO WIDTH SPACE)
Line 11, Position 10: ' ' (NO-BREAK SPACE)
How to Handle These Findings
When you detect invisible characters in code:
- Review the context to determine if the characters are legitimate or malicious
- Remove or replace the invisible characters with visible equivalents
- Implement automated checks in your CI/CD pipeline to prevent such issues
- Consider using linters that can detect these characters
Step 7: Preventive Measures
Adding Automated Checks
Integrate this detection into your development workflow by adding it to your pre-commit hooks or CI/CD pipeline. This helps catch invisible Unicode characters before they're committed to repositories.
Using Linters
Modern code editors and linters can be configured to highlight invisible characters. In VS Code, for example, you can enable the 'Show Invisible Characters' option in the settings.
Summary
In this tutorial, you've learned how to detect invisible Unicode characters in code files - a crucial skill for preventing supply-chain attacks. You've created a detection tool that can identify these hidden characters, understood why they're dangerous, and learned how to prevent them from entering your code repositories. By implementing these techniques, you're better equipped to maintain code security and detect potential malicious modifications that could otherwise go unnoticed.
Remember that invisible Unicode characters are just one type of attack vector in supply-chain security. Always maintain good security practices, including code reviews, automated testing, and keeping dependencies up to date.



