Introduction
In today's data-driven world, processing and extracting meaningful information from documents is crucial. This tutorial will guide you through building an advanced document intelligence pipeline using Google's LangExtract, OpenAI models, and structured extraction techniques. By the end of this tutorial, you'll have a working pipeline that can process documents and extract structured data in a reusable format.
This pipeline will:
- Take unstructured text as input
- Use OpenAI's powerful language models to extract key information
- Transform that information into a structured, machine-readable format
- Visualize the results in an interactive way
This is a beginner-friendly tutorial that assumes no prior experience with these tools. We'll walk through each step carefully, explaining why we do what we do.
Prerequisites
Before we begin, you'll need the following:
- Python 3.8 or higher installed on your machine
- An OpenAI API key (get one from OpenAI's website)
- Basic understanding of Python (functions, lists, dictionaries)
Step-by-Step Instructions
1. Install Required Libraries
First, we need to install the necessary Python libraries. Open your terminal or command prompt and run:
pip install google-langextract openai pandas plotly
Why? These libraries provide the core functionality we need:
- google-langextract: Helps in extracting structured data from text
- openai: Allows us to use OpenAI's language models
- pandas: For organizing and managing extracted data
- plotly: For interactive data visualization
2. Set Up Your OpenAI API Key
Next, we'll securely store your OpenAI API key. Create a new Python file (e.g., config.py) and add the following code:
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
# Get the API key from environment variable
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
Now, create a file named .env in the same folder with this content:
OPENAI_API_KEY=your_openai_api_key_here
Why? Storing your API key in a separate file and using environment variables is a secure practice. Never hardcode API keys in your scripts.
3. Create the Extraction Pipeline
Now, we'll build the main extraction logic. Create a new file called document_pipeline.py and add the following code:
import openai
from google.langextract import extract
import pandas as pd
# Load the API key
from config import OPENAI_API_KEY
# Configure OpenAI
openai.api_key = OPENAI_API_KEY
def extract_structured_data(text):
"""Extract structured data from unstructured text using OpenAI"""
prompt = f"Extract the following information from the text and return it as a JSON object:\n"
prompt += f"- Name\n- Email\n- Phone Number\n- Company\n"
prompt += f"Text: {text}"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
max_tokens=200,
temperature=0.3
)
return response['choices'][0]['message']['content']
def process_document(text):
"""Process a document and return structured data"""
# First, use LangExtract to get initial structure
langextract_result = extract(text)
# Then, use OpenAI to extract specific fields
extracted_data = extract_structured_data(text)
# Combine results
return {
"langextract": langextract_result,
"structured": extracted_data
}
Why? This function uses two tools:
- LangExtract: Provides a general structure of the document
- OpenAI: Extracts specific fields (name, email, etc.) in a structured format
4. Test the Pipeline
Let's test our pipeline with a sample document. Create a file called test_pipeline.py with this code:
from document_pipeline import process_document
# Sample document text
sample_text = "John Smith works at Google. His email is [email protected] and his phone number is (555) 123-4567."
# Process the document
result = process_document(sample_text)
print("LangExtract Result:")
print(result['langextract'])
print("\nStructured Data:")
print(result['structured'])
Why? This test ensures our pipeline works as expected before we move on to visualization.
5. Visualize the Results
Now, let's create an interactive visualization. Add this to your document_pipeline.py file:
import plotly.express as px
def visualize_extraction(data):
"""Create an interactive chart showing extracted data"""
# Convert structured data to a DataFrame
df = pd.DataFrame([data])
# Create a simple bar chart
fig = px.bar(df, x='Name', y='Email', title='Document Extraction Results')
fig.show()
Why? Visualization helps us understand and present the extracted data in an easy-to-read format.
6. Run the Complete Pipeline
Finally, let's run everything together. Update your test_pipeline.py file:
from document_pipeline import process_document, visualize_extraction
# Sample document text
sample_text = "John Smith works at Google. His email is [email protected] and his phone number is (555) 123-4567."
# Process the document
result = process_document(sample_text)
print("LangExtract Result:")
print(result['langextract'])
print("\nStructured Data:")
print(result['structured'])
# Visualize the results
visualize_extraction(result['structured'])
Why? This combines all our components into a full working pipeline.
Summary
Congratulations! You've built a complete document intelligence pipeline that:
- Processes unstructured text
- Uses Google's LangExtract for initial structure
- Uses OpenAI models for specific field extraction
- Visualizes the results interactively
This pipeline can be extended to handle more complex document types, additional fields, and more sophisticated visualizations. You've now learned the basics of building advanced document intelligence systems using modern AI tools.



