Introduction
In this tutorial, we'll explore how to leverage IBM's Granite 4.0 3B Vision model for enterprise-grade document data extraction. This vision-language model combines visual reasoning with language understanding to extract structured information from documents. We'll build a practical application that demonstrates document processing capabilities using the Hugging Face Transformers library and the granite-4.0-3b-vision model.
Prerequisites
- Python 3.8 or higher
- Basic understanding of machine learning and computer vision concepts
- Installed packages: transformers, torch, pillow, requests
- Access to a GPU (recommended) for optimal performance
Step-by-Step Instructions
1. Setting Up the Environment
1.1 Install Required Dependencies
We need to install the necessary libraries for working with the vision-language model. The transformers library provides easy access to pre-trained models, while torch handles the computational operations.
pip install transformers torch pillow requests
Why: These libraries provide the foundation for loading and running the model, handling image processing, and making HTTP requests to access model resources.
1.2 Import Required Libraries
Start by importing the necessary modules for our implementation:
import torch
from transformers import AutoTokenizer, AutoModelForVisionToText
from PIL import Image
import requests
from io import BytesIO
Why: These imports give us access to the tokenizer for text processing, the vision-to-text model for document analysis, and image handling capabilities.
2. Loading the Granite 4.0 3B Vision Model
2.1 Initialize Model and Tokenizer
Load the pre-trained Granite 4.0 3B Vision model from Hugging Face. This model is specifically optimized for document data extraction tasks.
model_name = "ibm/granite-4.0-3b-vision"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForVisionToText.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True)
# Move model to GPU if available
if torch.cuda.is_available():
model = model.to("cuda")
print("Model loaded on GPU")
else:
print("Model loaded on CPU")
Why: We use the AutoTokenizer and AutoModelForVisionToText classes to automatically configure the model for vision-to-text tasks. The float16 data type reduces memory usage while maintaining accuracy.
2.2 Configure Model for Inference
Set the model to evaluation mode and configure it for optimal performance:
model.eval()
# Set max new tokens for generation
max_new_tokens = 200
Why: Evaluation mode ensures the model behaves correctly during inference, and setting max_new_tokens controls the length of generated text responses.
3. Preparing Document Images
3.1 Load Sample Document Images
Prepare document images for processing. For demonstration, we'll use a sample document image:
# Download a sample document image
image_url = "https://example.com/sample_document.png"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
# Display the image
image.show()
Why: This step demonstrates how to load document images from URLs, which is typical for enterprise applications where documents may be stored remotely.
3.2 Preprocess Images
Prepare images for model input by resizing and converting to the appropriate format:
# Resize image to model's expected input size
image = image.resize((768, 768))
# Convert to RGB if needed
if image.mode != "RGB":
image = image.convert("RGB")
Why: Preprocessing ensures consistent input format and size, which is crucial for model performance and consistency.
4. Performing Document Data Extraction
4.1 Create Input Prompt
Construct a prompt that guides the model to extract specific information from the document:
# Define extraction prompt
prompt = "Extract all the key information from this document. Focus on dates, names, amounts, and important identifiers."
# Prepare input
inputs = tokenizer(prompt, return_tensors="pt")
inputs["pixel_values"] = image
Why: The prompt guides the model's attention to specific data types, improving extraction accuracy for enterprise use cases.
4.2 Generate Text Response
Run the model to extract information from the document:
# Generate response
with torch.no_grad():
generated_ids = model.generate(
inputs["pixel_values"],
input_ids=inputs["input_ids"],
max_new_tokens=max_new_tokens,
num_beams=4,
do_sample=True,
temperature=0.7
)
# Decode the response
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)
Why: This generates the extracted information from the document, using beam search for better text generation quality.
5. Processing Multiple Documents
5.1 Create Document Processing Function
Build a reusable function for processing multiple documents:
def process_document(image_path, prompt):
# Load image
image = Image.open(image_path)
# Preprocess
image = image.resize((768, 768))
if image.mode != "RGB":
image = image.convert("RGB")
# Prepare inputs
inputs = tokenizer(prompt, return_tensors="pt")
inputs["pixel_values"] = image
# Generate response
with torch.no_grad():
generated_ids = model.generate(
inputs["pixel_values"],
input_ids=inputs["input_ids"],
max_new_tokens=max_new_tokens,
num_beams=4,
do_sample=True,
temperature=0.7
)
return tokenizer.decode(generated_ids[0], skip_special_tokens=True)
Why: This reusable function allows batch processing of documents, which is essential for enterprise applications handling large volumes of data.
5.2 Process Batch of Documents
Process multiple documents with the same function:
# Process multiple documents
document_paths = ["document1.png", "document2.png", "document3.png"]
extraction_prompt = "Extract all the key financial data including amounts, dates, and transaction IDs."
for i, path in enumerate(document_paths):
try:
result = process_document(path, extraction_prompt)
print(f"Document {i+1} extraction result:")
print(result)
print("-" * 50)
except Exception as e:
print(f"Error processing {path}: {str(e)}")
Why: Batch processing demonstrates scalability for enterprise applications where multiple documents need automated extraction.
6. Optimizing Performance
6.1 Implement Model Caching
Cache model outputs to avoid redundant processing:
import hashlib
import os
# Simple caching mechanism
def get_cache_key(prompt, image_hash):
return hashlib.md5((prompt + image_hash).encode()).hexdigest()
# Store results in cache
cache = {}
# Check cache before processing
if cache_key in cache:
return cache[cache_key]
else:
result = process_document(image_path, prompt)
cache[cache_key] = result
return result
Why: Caching improves performance by avoiding redundant processing of identical documents, crucial for enterprise applications.
Summary
In this tutorial, we've learned how to work with IBM's Granite 4.0 3B Vision model for enterprise document data extraction. We covered setting up the environment, loading the model, preprocessing document images, and extracting structured information from documents. The implementation demonstrates key concepts like prompt engineering, batch processing, and performance optimization that are essential for real-world enterprise applications.
Key takeaways include understanding how to leverage vision-language models for document analysis, implementing efficient processing pipelines, and optimizing for enterprise-scale deployments. This approach can be extended to various document types including invoices, contracts, and reports, making it valuable for businesses seeking automated data extraction solutions.



