Meet MemPrivacy: An Edge-Cloud Framework that Uses Local Reversible Pseudonymization to Protect User Data Without Breaking Memory Utility

Learn how to build a basic privacy-preserving system using reversible pseudonymization, inspired by the MemPrivacy framework.

Introduction

In today's world of AI and machine learning, keeping user data private while still making it useful for systems like large language models (LLMs) is a major challenge. This tutorial will guide you through creating a simple version of a privacy-preserving system inspired by MemPrivacy, which uses a technique called reversible pseudonymization to protect user data. This approach allows data to be used for training and inference while ensuring user privacy is maintained.

By the end of this tutorial, you will have built a basic system that demonstrates how pseudonymized data can be reversed back to its original form when needed, without compromising utility in a simulated environment.

Prerequisites

Basic understanding of Python programming
Python 3.6 or higher installed on your computer
Optional: Familiarity with concepts like data encryption and hashing

Step-by-Step Instructions

1. Set Up Your Python Environment

First, create a new directory for this project and navigate into it:

mkdir memprivacy_demo
 cd memprivacy_demo

Next, create a Python file called memprivacy.py:

touch memprivacy.py

This file will contain all our code for this demonstration.

2. Import Required Libraries

Open memprivacy.py in your favorite text editor and start by importing the necessary Python libraries:

import hashlib
import base64
import json
from typing import Dict, List, Tuple

We'll use hashlib for generating secure hashes, base64 for encoding data, and json for handling structured data. The typing module helps with code clarity.

3. Create a Pseudonymization Function

The core idea of reversible pseudonymization is to transform data in a way that can be reversed using a key. Let's implement a basic version:

def pseudonymize_data(data: str, key: str) -> str:
    """Pseudonymize data using a key and hashing."""
    # Combine data and key
    combined = data + key
    # Generate a hash of the combined string
    hash_object = hashlib.sha256(combined.encode())
    # Return the hash as a base64-encoded string
    return base64.b64encode(hash_object.digest()).decode()

Why? This function creates a unique, irreversible representation of the data that depends on both the data and a secret key. This makes it difficult to reverse without the key, but still allows us to identify the original data when needed.

4. Create a Reversal Function

For a system to be truly useful, we need to be able to reverse the pseudonymization when necessary:

def reverse_pseudonymize(pseudonymized_data: str, key: str, original_data: str) -> bool:
    """Check if pseudonymized data matches original data with given key."""
    # Re-generate the pseudonymized version
    regenerated = pseudonymize_data(original_data, key)
    # Compare with the provided pseudonymized data
    return regenerated == pseudonymized_data

Why? This function ensures that we can validate that a pseudonymized entry matches the original data, which is essential for systems that need to confirm identity or data integrity.

5. Simulate a User Data System

Now let's simulate a simple user data system where we store pseudonymized data:

class UserMemory:
    def __init__(self, key: str):
        self.key = key
        self.memory: Dict[str, str] = {}

    def store(self, user_id: str, data: str):
        """Store pseudonymized data for a user."""
        pseudonymized = pseudonymize_data(data, self.key)
        self.memory[user_id] = pseudonymized
        print(f"Stored pseudonymized data for {user_id}")

    def retrieve(self, user_id: str, data: str) -> bool:
        """Check if data matches stored pseudonymized data."""
        if user_id in self.memory:
            return reverse_pseudonymize(self.memory[user_id], self.key, data)
        return False

    def get_all_users(self) -> List[str]:
        """Return list of all stored user IDs."""
        return list(self.memory.keys())

Why? This class simulates how a system might store and retrieve user data while maintaining privacy. The key ensures that only authorized users can reverse the pseudonymization.

6. Test Your System

Let's now test the system with some example data:

def main():
    # Create a system with a secret key
    key = "my_secret_key"
    user_system = UserMemory(key)

    # Store some user data
    user_system.store("user_001", "John's email is [email protected]")
    user_system.store("user_002", "Alice's phone number is 123-456-7890")

    # Try to retrieve and validate data
    print("\nValidating stored data:")
    print(user_system.retrieve("user_001", "John's email is [email protected]"))  # Should be True
    print(user_system.retrieve("user_002", "Alice's phone number is 123-456-7890"))  # Should be True
    print(user_system.retrieve("user_001", "Wrong data"))  # Should be False

    # List all stored users
    print("\nStored users:", user_system.get_all_users())

if __name__ == "__main__":
    main()

Why? This test verifies that our system works correctly, showing how pseudonymized data can be stored and validated without exposing the original data.

7. Run Your Code

Save your file and run it:

python memprivacy.py

You should see output showing that data was stored and validated correctly. This simple system demonstrates how pseudonymization can protect privacy while still allowing useful operations.

Summary

In this tutorial, we built a basic demonstration of a privacy-preserving system using reversible pseudonymization, inspired by the MemPrivacy framework. We created functions to pseudonymize and reverse pseudonymized data, and simulated a user data system that stores and retrieves data securely.

This system shows how privacy can be maintained even when data is used for utility purposes. While this is a simplified example, it demonstrates the core principles that real systems like MemPrivacy use to protect user data in edge-cloud environments.

As you continue learning, consider how this approach might be extended to work with more complex data types, or integrated with real-world systems for enhanced privacy and utility.