Introduction
In this tutorial, you'll learn how to implement a data extraction pipeline similar to Google's Groundsource methodology. The goal is to build a system that can process unstructured news reports and extract structured data about natural disasters. This tutorial focuses on using large language models (LLMs) to transform raw text into structured formats that can be used for analysis and machine learning applications.
While the full Groundsource system is proprietary, we'll recreate the core concept using publicly available tools and open-source datasets to demonstrate how structured data can be extracted from unstructured news sources.
Prerequisites
- Basic understanding of Python programming
- Intermediate knowledge of natural language processing (NLP)
- Access to a Google Cloud account with Gemini API access
- Python libraries:
requests,json,pandas,openai(orgoogle-generativeai) - Basic familiarity with data processing and JSON manipulation
Step-by-Step Instructions
1. Set Up Your Environment
First, create a new Python virtual environment and install the required packages:
python -m venv groundsource_env
source groundsource_env/bin/activate # On Windows: groundsource_env\Scripts\activate
pip install google-generativeai pandas requests
Why: This ensures you have a clean environment with all necessary dependencies. The google-generativeai package provides access to Gemini models, while pandas helps with data manipulation.
2. Configure Your Gemini API Access
Obtain your API key from the Google Cloud Console and set it as an environment variable:
export GOOGLE_API_KEY="your_api_key_here"
Then in your Python script, initialize the client:
import os
import google.generativeai as genai
os.environ['GOOGLE_API_KEY'] = os.getenv('GOOGLE_API_KEY')
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
Why: The Gemini API key is required to access the language model. Setting it as an environment variable keeps your credentials secure.
3. Prepare Sample News Data
Create a sample news article about a flash flood event to test your extraction pipeline:
sample_news = '''
On June 15, 2023, a severe flash flood struck downtown Seattle. The event lasted approximately 3 hours and affected 15,000 residents. Emergency services responded within 30 minutes. No casualties were reported. The flood caused $2.3 million in property damage and disrupted transportation for 2 days.
'''
Why: This sample represents the type of unstructured text that would be processed in a real-world scenario. It contains key information that we'll extract into structured data.
4. Define Your Data Extraction Prompt
Create a prompt that instructs the Gemini model to extract specific information from news articles:
prompt_template = '''
Extract the following information from the news article:
1. Date of the event (format: YYYY-MM-DD)
2. Location (city, state, country)
3. Event type (e.g., flash flood, earthquake, hurricane)
4. Duration (in hours or days)
5. Number of affected residents
6. Emergency response time (in minutes)
7. Casualties (number)
8. Property damage (in USD)
9. Transportation disruption (in days)
Return only a JSON object with these fields. Do not include any additional text.
News article: {article}
'''
Why: This structured prompt guides the LLM to extract specific, consistent data points that can be easily parsed and stored in a database or CSV file.
5. Implement the Extraction Function
Write a function that sends the prompt to the Gemini model and processes the response:
def extract_event_data(article):
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content(prompt_template.format(article=article))
# Try to parse the JSON response
try:
return json.loads(response.text)
except json.JSONDecodeError:
print(f"Failed to parse JSON: {response.text}")
return None
Why: This function encapsulates the core logic of extracting structured data from unstructured text using the Gemini model. The error handling ensures that malformed responses don't break your entire pipeline.
6. Test Your Extraction Pipeline
Run your extraction function with the sample news article:
extracted_data = extract_event_data(sample_news)
if extracted_data:
print(json.dumps(extracted_data, indent=2))
Why: Testing with a known input helps verify that your pipeline works correctly before scaling to larger datasets.
7. Process Multiple Articles
Create a function to process multiple news articles and store results in a structured format:
def process_news_batch(articles):
results = []
for article in articles:
data = extract_event_data(article)
if data:
results.append(data)
return results
# Example usage
articles = [sample_news, another_news_article]
batch_results = process_news_batch(articles)
df = pd.DataFrame(batch_results)
df.to_csv('flash_flood_events.csv', index=False)
Why: This step demonstrates how to scale your extraction process to handle multiple articles, which is essential for building a comprehensive dataset like Groundsource's.
8. Validate and Refine Results
After processing multiple articles, review the extracted data for consistency and accuracy:
# Check for missing or inconsistent data
print(df.isnull().sum())
print(df.describe())
# Look for patterns in data quality
print(df['event_type'].value_counts())
Why: Data validation ensures that your extracted information is reliable and can be used for further analysis or machine learning models.
Summary
In this tutorial, you've learned how to implement a data extraction pipeline using Gemini models to transform unstructured news reports into structured data. You've created a system that can extract key information about natural disasters from text, similar to Google's Groundsource methodology. This approach demonstrates how large language models can be used to automate data collection and preprocessing, which is crucial for building comprehensive historical datasets.
The pipeline you've built can be extended to process larger datasets, integrate with databases, or be used as part of a larger AI system for disaster prediction and response planning. This is a foundational technique that many AI researchers and data scientists use when working with public datasets and real-world problem-solving.



