Building Smart AI Workflows: From PDFs to Structured Data
If you've been playing with ChatGPT or other AI tools, you know they're great at generating text. But what if you need more than just text? What if you need data in a specific format, ready to drop into a database? That's where structured output comes in.
Let me walk you through a real project that processes invoices from PDF files and extracts structured data using AI. Along the way, I'll show you some practical techniques that make AI workflows actually useful in production.
The Problem: PDFs Are Messy, Databases Need Structure
Say you run a business and receive dozens of PDF invoices every week. Each invoice has vendor info, customer details, amounts, dates, and tax information. You want all this data in a database for analysis, but manually typing it in? That's a waste of time.
The challenge: PDFs are just pixels and text. Databases need structured data with specific fields and types. We need AI to bridge that gap.
The Two-Step Approach: Python First, AI Second
Here's a key insight from this project: don't use AI for everything. Use it where it adds value.
def get_pdf_content(pdf_path: str) -> str:
with open(pdf_path, "rb") as f:
reader = PdfReader(f)
text = ""
for page in reader.pages:
text += page.extract_text()
return textThis code uses a Python library called pypdf to extract text from PDFs. No AI needed. It's faster, cheaper, and more reliable than asking AI to read a PDF.
The workflow:
- Python extracts text from PDF (deterministic, cheap)
- AI extracts structured data from text (intelligent, necessary)
This is what I call a hybrid approach: use traditional code for mechanical tasks, use AI for tasks that need understanding.
Structured Output: Teaching AI to Speak JSON
Once we have the text, we need to tell the AI exactly what format we want. This is where OpenAI's structured output API shines.
Here's the magic part:
response = requests.post(
"https://api.openai.com/v1/responses",
json={
"model": "gpt-4o-mini",
"input": prompt,
"text": {
"format": {
"name": "invoice",
"type": "json_schema",
"schema": {
"type": "object",
"properties": {
"vendor": {
"type": "object",
"properties": {
"name": {"type": "string"},
"address": {"type": "string"},
"taxId": {"type": "string"}
},
"required": ["name", "address", "taxId"]
},
"customer": { /* similar structure */ },
"invoiceNumber": {"type": "string"},
"date": {"type": "string"},
"totalAmount": {"type": "number"},
"tax": {"type": "number"}
},
"required": ["vendor", "customer", "invoiceNumber", "date", "totalAmount", "tax"]
},
"strict": True
}
}
}
)That "strict": True parameter is crucial. It tells OpenAI: "Don't give me anything that doesn't match this schema." No extra commentary, no "Here's what I found..." Just pure, validated JSON.
The schema acts like a contract. You define the structure once, and every response follows it. This means your database insertion code never breaks:
def insert_invoice_data(conn, invoice_data):
cursor = conn.cursor()
cursor.execute('''
INSERT INTO invoices (
vendor_name, vendor_address, vendor_tax_id,
customer_name, customer_address, customer_tax_id,
invoice_number, "date", total_amount, tax
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
invoice_data.get("vendor", {}).get("name"),
invoice_data.get("vendor", {}).get("address"),
invoice_data.get("vendor", {}).get("taxId"),
# ... rest of the fields
))
conn.commit()Multi-Step AI Workflows: The Web Content Pipeline
The second part of this project shows a more complex pattern: chaining multiple AI calls together, each building on the previous one.
The pipeline:
URL → Fetch HTML → Extract Core Content → Summarize → Generate Social PostEach step refines the output:
Step 1: Extract Core Content
def extract_core_website_content(html: str) -> str:
response = get_ai_response(
model="gemma3:4b-it-qat",
prompt=f"""
You are an expert web content extractor.
Extract the core content from this HTML, excluding navigation, footers, scripts, etc.
<html>
{html}
</html>
""",
ctx=20000 # Large context window for HTML
)
return responseNotice the ctx=20000 parameter? HTML pages can be huge. This tells the local Ollama model to allocate enough memory to handle large inputs.
Step 2: Summarize
def summarize_content(content: str) -> str:
response = get_ai_response(
model="gemma3:1b-it-qat", # Smaller model
prompt=f"""
Summarize this content into concise bullet points.
Avoid unnecessary explanations.
<content>
{content}
</content>
"""
)
return responseKey technique: Use a smaller model (1b) for straightforward tasks like summarization. It's faster and uses less memory. Save the big models for complex work.
Step 3: Generate Social Media Post
def generate_x_post(summary: str) -> str:
with open("post-examples.json", "r") as f:
examples = json.load(f)
examples_str = ""
for i, example in enumerate(examples, 1):
examples_str += f"""
<example-{i}>
<topic>{example['topic']}</topic>
<generated-post>{example['post']}</generated-post>
</example-{i}>
"""
prompt = f"""
You are an expert social media manager.
Generate an engaging X post based on this summary.
<summary>
{summary}
</summary>
Here are examples of your style:
<examples>
{examples_str}
</examples>
Use the same tone and structure, but don't reuse the content.
"""
response = get_ai_response(
model="gemma3:27b-it-qat", # Largest model for creative work
prompt=prompt
)
return responseThis demonstrates several advanced techniques:
Technique 1: Dynamic Prompt Engineering
The prompt isn't static. It dynamically includes:
- The summary (generated in the previous step)
- Examples loaded from a JSON file
- Specific instructions about tone and style
This is prompt composition: building prompts programmatically based on context.
Technique 2: Few-Shot Learning
By including examples from post-examples.json, we're showing the AI what "good" looks like. This is called few-shot prompting. Instead of explaining what you want, you show examples.
The AI learns:
- Your writing style
- Your formatting preferences
- Your tone
Technique 3: Model Selection Strategy
Look at the progression:
- 4b model for extraction (mechanical task)
- 1b model for summarization (simple task)
- 27b model for creative writing (complex task)
This is cost-optimization in action. The 27b model would work for summarization, but it's overkill. The 1b model is perfect and much faster.
Technique 4: Progressive Refinement
Each step narrows the focus:
- Raw HTML (10,000+ words, lots of noise)
- Core content (2,000 words, relevant info)
- Summary (200 words, key points)
- Social post (50 words, punchy and engaging)
This is better than asking AI to go directly from HTML to a social post. Each step is simpler and more reliable.
The Architecture That Ties It Together
The project uses two different AI approaches:
OpenAI API (cloud-based):
requests.post(
"https://api.openai.com/v1/responses",
json={
"model": "gpt-4o-mini",
"text": {"format": {"type": "json_schema", ...}}
}
)Ollama API (local):
requests.post(
"http://localhost:11434/api/generate",
json={
"model": "gemma3:1b-it-qat",
"prompt": prompt,
"stream": False
}
)OpenAI is great for structured output and complex extraction. Ollama runs locally, giving you privacy and no per-request costs. The developer chose the right tool for each job.
The Database Layer: Completing the Loop
After all the AI magic, we end up with structured data in SQLite:
def setup_database():
conn = sqlite3.connect("invoices.db")
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS invoices (
id INTEGER PRIMARY KEY,
vendor_name TEXT,
vendor_address TEXT,
vendor_tax_id TEXT,
customer_name TEXT,
customer_address TEXT,
customer_tax_id TEXT,
invoice_number TEXT,
date TEXT,
total_amount REAL,
tax REAL
)
''')
conn.commit()
return connNow you can run SQL queries:
SELECT vendor_name, SUM(total_amount)
FROM invoices
GROUP BY vendor_name;The AI workflow ends with actionable data, not just text sitting in a chat window.
Key Takeaways
If you're building AI workflows, here's what this project teaches us:
- Hybrid approach wins: Use traditional code where it's faster/cheaper, use AI where understanding is needed
- Structured output is essential: For production systems, you need predictable data formats, not free-form text
- Chain AI calls thoughtfully: Break complex tasks into steps, each with a clear input and output
- Match model size to task complexity: Small models for simple tasks, big models for creative/complex work
- Compose prompts dynamically: Build prompts from examples, context, and structured data
- Few-shot learning is powerful: Show the AI examples of what you want instead of just describing it
- Progressive refinement beats one-shot: Going from "messy input → refined output 1 → refined output 2 → final output" is more reliable than trying to do it all at once
The Bottom Line
AI workflows aren't about having a conversation with ChatGPT. They're about building systems that:
- Take real inputs (PDFs, websites, files)
- Process them through multiple steps
- Produce structured, usable outputs (databases, APIs, reports)
This invoice processor isn't just a demo. It's a production-ready pattern you can adapt for:
- Processing receipts
- Extracting data from contracts
- Parsing resumes
- Analyzing customer feedback
- Converting documents to structured data
The code is straightforward. The techniques are practical. And the results are immediately useful.
That's how you build AI workflows that actually work.
Tech Stack Used:
- Python 3.11
- OpenAI API (gpt-4o-mini)
- Ollama (gemma3 models: 1b, 4b, 27b)
- pypdf for PDF text extraction
- SQLite for data storage
- requests for HTTP calls
Setup:
# Install dependencies
uv sync
# Set API key
echo "OPENAI_API_KEY=your-key-here" > .env
# Run invoice processor
python main.py /path/to/invoices/
# Run web content processor
python open.pyThe full code is available here: https://github.com/vvduth/ai-basic/tree/main/02-project