Supply-chain attack using invisible code hits GitHub and other repositories

Learn to detect invisible Unicode characters in code repositories that attackers use in supply-chain attacks. This tutorial teaches you to identify and prevent malicious code injection using Python detection tools.

Introduction

In this tutorial, you'll learn how to detect and handle invisible Unicode characters in code repositories - a technique that attackers have exploited in recent supply-chain attacks. Understanding these invisible characters is crucial for maintaining code security and preventing malicious modifications that can go unnoticed. We'll walk through practical steps to identify these hidden characters in your code files and create a simple detection tool.

Prerequisites

A basic understanding of programming concepts
Python installed on your computer
A text editor or IDE (like VS Code or Sublime Text)
Access to a code repository or text files to examine

Step 1: Understanding Invisible Unicode Characters

What Are They?

Invisible Unicode characters are special symbols that appear as empty space or look identical to regular characters but have different code points. Attackers use these to hide malicious code or create subtle modifications that bypass normal code reviews.

Why This Matters

These characters can be used in supply-chain attacks to inject malicious code that looks identical to legitimate code. They're particularly dangerous because they're invisible to the human eye, making them extremely hard to detect during manual code reviews.

Step 2: Setting Up Your Environment

Create a Test Directory

First, create a new directory for our experiment:

mkdir unicode_detector
 cd unicode_detector

Install Required Tools

We'll use Python for our detection tool. No special libraries are needed for basic detection, but we'll install the 'unicodedata' module which comes with Python by default.

Step 3: Creating a Sample Test File

Creating a Test File with Invisible Characters

Let's create a test file that contains both regular and invisible Unicode characters:

touch test_file.py

Open the file in your text editor and add this content:

# This is a test file

# Regular code
print("Hello World")

# This line contains invisible Unicode characters
print("Hello World" + "\u200B")  # Zero-width space character

# Another invisible character
x = "test\u00A0"  # Non-breaking space

Why We're Using These Characters

The characters we're using are common invisible Unicode characters that can be used in attacks:

\u200B - Zero-width space (invisible but takes up space)
\u00A0 - Non-breaking space (looks like regular space but is different)

These characters are chosen because they're commonly used in malicious code injection attempts.

Step 4: Writing the Detection Script

Creating the Detection Tool

Create a Python script called detect_unicode.py:

touch detect_unicode.py

Open the file and add this code:

#!/usr/bin/env python3

import unicodedata
import sys

# List of invisible Unicode characters to detect
INVISIBLE_CHARS = [
    '\u0000',  # Null character
    '\u0001',  # Start of Header
    '\u0002',  # Start of Text
    '\u0003',  # End of Text
    '\u0004',  # End of Transmission
    '\u0005',  # Enquiry
    '\u0006',  # Acknowledge
    '\uu0007',  # Bell
    '\u0008',  # Backspace
    '\u000B',  # Vertical Tab
    '\u000C',  # Form Feed
    '\u000E',  # Shift Out
    '\u000F',  # Shift In
    '\u0010',  # Data Link Escape
    '\u0011',  # Device Control 1
    '\u0012',  # Device Control 2
    '\u0013',  # Device Control 3
    '\u0014',  # Device Control 4
    '\u0015',  # Negative Acknowledge
    '\u0016',  # Synchronous Idle
    '\u0017',  # End of Transmission Block
    '\u0018',  # Cancel
    '\u0019',  # End of Medium
    '\u001A',  # Substitute
    '\u001B',  # Escape
    '\u001C',  # File Separator
    '\u001D',  # Group Separator
    '\u001E',  # Record Separator
    '\u001F',  # Unit Separator
    '\u00A0',  # Non-breaking space
    '\u2000',  # En Quad
    '\u2001',  # Em Quad
    '\u2002',  # En Space
    '\u2003',  # Em Space
    '\u2004',  # Three-Per-Em Space
    '\u2005',  # Four-Per-Em Space
    '\u2006',  # Six-Per-Em Space
    '\u2007',  # Figure Space
    '\u2008',  # Punctuation Space
    '\u2009',  # Thin Space
    '\u200A',  # Hair Space
    '\u200B',  # Zero-width space
    '\u200C',  # Zero-width non-joiner
    '\u200D',  # Zero-width joiner
    '\u200E',  # Left-to-right mark
    '\u200F',  # Right-to-left mark
    '\u2028',  # Line separator
    '\u2029',  # Paragraph separator
    '\u202A',  # Left-to-right embedding
    '\u202B',  # Right-to-left embedding
    '\u202C',  # Pop directional formatting
    '\u202D',  # Left-to-right override
    '\u202E',  # Right-to-left override
    '\u2060',  # Word joiner
    '\u2061',  # Function application
    '\u2062',  # Invisible times
    '\u2063',  # Invisible separator
    '\u2064',  # Invisible plus
    '\u2065',  # Medium mathematical space
    '\u2066',  # Left-to-right isolate
    '\u2067',  # Right-to-left isolate
    '\u2068',  # First strong isolate
    '\u2069',  # Pop directional isolate
    '\u206A',  # Inhibit symmetric swapping
    '\u206B',  # Activate symmetric swapping
    '\u206C',  # Inhibit arabic form shaping
    '\u206D',  # Activate arabic form shaping
    '\u206E',  # National digit shapes
    '\u206F',  # Nominal digit shapes
    '\uFEFF',  # Zero width no-break space (BOM)
]

def is_invisible_char(char):
    """Check if a character is an invisible Unicode character"""
    return char in INVISIBLE_CHARS


def detect_invisible_chars(filename):
    """Detect invisible Unicode characters in a file"""
    print(f"Analyzing file: {filename}")
    print("""\nInvisible characters found:""")
    
    with open(filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        
    line_number = 0
    for line in lines:
        line_number += 1
        for char_index, char in enumerate(line):
            if is_invisible_char(char):
                # Get character name for display
                try:
                    char_name = unicodedata.name(char)
                except ValueError:
                    char_name = "Unknown character"
                
                print(f"  Line {line_number}, Position {char_index}: '{char}' ({char_name})")


def main():
    if len(sys.argv) != 2:
        print("Usage: python detect_unicode.py ")
        sys.exit(1)
    
    filename = sys.argv[1]
    detect_invisible_chars(filename)


if __name__ == "__main__":
    main()

Why This Code Works

This script creates a comprehensive list of invisible Unicode characters and checks each character in a file against this list. When it finds an invisible character, it displays its position in the file and its Unicode name for identification.

Step 5: Running the Detection Tool

Make the Script Executable

chmod +x detect_unicode.py

Test the Tool

Run the detection script on our test file:

python detect_unicode.py test_file.py

What to Expect

You should see output showing the positions of invisible characters in your test file. The tool will display line numbers, character positions, and the actual invisible characters with their Unicode names.

Step 6: Understanding the Results

Interpreting the Output

The output will show something like:

Analyzing file: test_file.py

Invisible characters found:
  Line 7, Position 22: '' (ZERO WIDTH SPACE)
  Line 11, Position 10: ' ' (NO-BREAK SPACE)

How to Handle These Findings

When you detect invisible characters in code:

Review the context to determine if the characters are legitimate or malicious
Remove or replace the invisible characters with visible equivalents
Implement automated checks in your CI/CD pipeline to prevent such issues
Consider using linters that can detect these characters

Step 7: Preventive Measures

Adding Automated Checks

Integrate this detection into your development workflow by adding it to your pre-commit hooks or CI/CD pipeline. This helps catch invisible Unicode characters before they're committed to repositories.

Using Linters

Modern code editors and linters can be configured to highlight invisible characters. In VS Code, for example, you can enable the 'Show Invisible Characters' option in the settings.

Summary

In this tutorial, you've learned how to detect invisible Unicode characters in code files - a crucial skill for preventing supply-chain attacks. You've created a detection tool that can identify these hidden characters, understood why they're dangerous, and learned how to prevent them from entering your code repositories. By implementing these techniques, you're better equipped to maintain code security and detect potential malicious modifications that could otherwise go unnoticed.

Remember that invisible Unicode characters are just one type of attack vector in supply-chain security. Always maintain good security practices, including code reviews, automated testing, and keeping dependencies up to date.