Building an AI Agent for Invoice Processing: Architecture Deep Dive
Table of Contents
- The Problem
- Is This RAG?
- System Category: Agentic Pipeline
- High-Level Architecture
- The Technology Stack
- Layer 1 — The Acting Pipeline (11 Stages)
- How LLM Calls Are Structured
- Layer 2 — The Investigation Layer
- Layer 3 — The Adaptive Learning Framework (ALF)
- The Learning Mode: Teaching the Agent
- Streaming the Pipeline in Real Time
- Provider Agnosticism: One Interface, Any Model
- Why Not LangChain, LlamaIndex, or RAG?
- Data Flow: End to End
- Key Design Decisions
1. The Problem
Companies receive hundreds of invoices every month. Each one is a PDF that a human finance team member must manually:
- Read the numbers, dates, and line items out of the PDF
- Verify that the maths adds up correctly
- Check that the invoice complies with the pre-authorized work order (the WAF — Work Authorization Form)
- Decide: pay it or send it back with a rejection reason
This is repetitive, slow, and error-prone at scale. I built an AI agent that automates the entire workflow — upload a PDF, get a structured ACCEPT or REJECT verdict in seconds, with a full explanation and audit trail.
What makes it interesting architecturally is not just the automation, but how I designed the system to:
- Be transparent about its own reasoning (the investigation layer audits the agent itself)
- Learn from human corrections without retraining (the adaptive learning framework)
- Work with any LLM provider by changing one environment variable
- Never become a black box — every stage writes a structured JSON artifact to disk
2. Is This RAG?
No. This is one of the most important clarifications I can make about the architecture.
RAG (Retrieval-Augmented Generation) is a pattern where, for each query, you:
- Convert the query into a vector embedding
- Search a vector database for the closest matching chunks of stored knowledge
- Inject those chunks as context into the LLM prompt
- Let the LLM answer using that retrieved context
RAG pattern:
Query → Embed → Vector DB search → Top-K chunks → LLM prompt → Answer
This system does none of that. There is no vector database, no embedding model, no similarity search.
Instead, this is a document processing pipeline combined with a rule-based correction engine. The "knowledge" in this system is:
| Knowledge type | Where it lives | How it's used |
|---|---|---|
| Business rules | reconstructed_rules_book.md | Loaded once, fits in one LLM context window |
| Correction rules | rule_base.json | Deterministic JSON rules evaluated at runtime |
| Invoice data | Uploaded PDFs | Parsed fresh per run by pdfplumber |
| Historical cases | Filesystem (agent_output/) | Read by the Learning mode for impact assessment |
The rules book is small enough to fit entirely inside an LLM prompt — so there is no need to retrieve chunks of it. That is a deliberate architectural choice that keeps the system simpler and faster than RAG would be.
If the rules book grew to thousands of pages, I would add RAG at the investigation layer only. The acting pipeline would remain unchanged.
3. System Category: Agentic Pipeline
The correct category for this architecture is an agentic pipeline with structured LLM calls.
┌─────────────────────────────────────────────────────┐
│ Architectural Pattern │
│ │
│ Agentic Pipeline — NOT: │
│ ✗ RAG (no vector retrieval) │
│ ✗ ReAct agent (no tool-calling loop) │
│ ✗ Multi-agent debate (one pipeline, one agent) │
│ │
│ IS: │
│ ✓ Sequential LLM pipeline │
│ ✓ Structured output extraction (Instructor) │
│ ✓ Deterministic validation gates │
│ ✓ LLM-audited compliance layer │
│ ✓ Rule-based correction engine (ALF) │
│ ✓ Human-in-the-loop learning mode │
└─────────────────────────────────────────────────────┘
The LLM is not "thinking freely" or "choosing what to do next" — each call has a strict Pydantic schema it must return, enforced by the Instructor library with automatic retries. The LLM is used as a structured data extractor and semantic reasoner, not as an autonomous agent.
4. High-Level Architecture
┌───────────────────────────────────────────────────────────────────┐
│ User / SME │
│ Upload PDF Review cases, write rules │
└────────────┬──────────────────────────────────────────────────────┘
│ multipart/form-data │ REST JSON
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI (async, uvicorn) │
│ POST /api/inference/run /api/learning/* │
│ GET /api/inference/stream/{id} SSE events │
└───────────────────┬───────────────────────────────┬─────────────┘
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────────┐
│ Acting Pipeline │ │ Learning Mode │
│ (11 stages) │ │ SafeRule Orchestrator │
└───────────┬───────────┘ └───────────────────────────┘
│
┌─────────┼─────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ LiteLLM │ │pdfplumber│ │rule_base │
│+Instructor│ │ PDF parse│ │ .json │
└──────────┘ └──────────┘ └──────────┘
│
┌───────┴────────┐
│ OpenAI │
│ Anthropic │ ← swapped via SMART_MODEL / FAST_MODEL env vars
│ Gemini │
│ Ollama │
└────────────────┘
The frontend is a React 19 + Vite SPA that communicates with the FastAPI backend over JSON REST and Server-Sent Events for the live pipeline stream.
5. The Technology Stack
I deliberately kept this stack minimal and replaceable. The philosophy: use open-source libraries for what they solve well; do not add frameworks just to have a framework.
| Concern | Library | Why |
|---|---|---|
| Web framework | FastAPI | Async, OpenAPI docs, Pydantic-native |
| LLM provider abstraction | LiteLLM | One interface, 100+ providers — the Python equivalent of Laravel Prism |
| Structured LLM output | Instructor | Wraps LiteLLM with Pydantic validation + auto-retry on bad JSON |
| Settings | pydantic-settings | Typed .env loading with zero boilerplate |
| PDF parsing | pdfplumber | Reliable text extraction from invoices |
| Logging | structlog | JSON-structured logs per stage |
| Frontend | React 19 + Vite + TypeScript | React Compiler, useActionState, native SSE |
What I deliberately excluded:
| Library | Why excluded |
|---|---|
| LangChain | No chains or tool-calling needed; plain async def is cleaner |
| LlamaIndex | No vector retrieval; rules book fits in context |
| Vector DB | No RAG; deterministic rules + LLM reasoning is sufficient |
| Celery / Redis | Out of scope for v1; add only if pipeline becomes background-queued |
6. Layer 1 — The Acting Pipeline (11 Stages)
The core of the system. When a PDF is uploaded, it passes through 11 sequential stages:
PDF files
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ACTING PIPELINE │
│ │
│ Stage 1 🗂️ Classification LLM (fast model) │
│ ↓ Which PDF is the invoice? Which is the WAF? │
│ │
│ Stage 2 🔍 Extraction LLM (smart model) │
│ ↓ Extract all fields into Pydantic models │
│ │
│ Stage 3 📌 Phase 1 Validation Deterministic │
│ ↓ Are mandatory fields present? │
│ │
│ Stage 4 📋 Phase 2 Validation Deterministic │
│ ↓ Are line items present and valid? │
│ │
│ Stage 5 📅 Phase 3 Validation Deterministic │
│ ↓ Are dates valid? Is tax ID format correct? │
│ │
│ Stage 6 🔢 Phase 4 Validation Deterministic │
│ ↓ Do totals balance? Do hours match the WAF? │
│ │
│ Stage 7 🔄 Transformation LLM (fast model) │
│ ↓ Map line items to standard 9-code taxonomy │
│ │
│ Stage 8 ⚖️ Decision Deterministic │
│ ↓ ACCEPT or REJECT │
│ │
│ Stage 9 📝 Audit Log Deterministic │
│ ↓ Write all artifacts to disk │
│ │
│ Stage 10 🔎 Investigation LLM (smart model) — opt. │
│ ↓ Did the agent follow the rules book? │
│ │
│ Stage 11 🤖 ALF Rule Engine Deterministic + opt. LLM │
│ ↓ Apply learned correction rules │
└─────────────────────────────────────────────────────────────┘
│
▼
InferenceResponse (Pydantic model → JSON to client)
Short-circuit logic
Validation stages 3–6 can short-circuit the pipeline. If Phase 1 validation fails (e.g. the invoice is missing a vendor name), the pipeline jumps directly to Stage 8 (Decision = REJECT), skipping stages 4–7. This mimics how a human reviewer would stop reading after finding a fundamental problem.
Stage 3 FAIL ──────────────────────────────────→ Stage 8 REJECT
Stage 4 FAIL ───────────────────────────────────→ Stage 8 REJECT
Stage 5 FAIL ────────────────────────────────────→ Stage 8 REJECT
Stage 6 FAIL ─────────────────────────────────────→ Stage 8 REJECT
All pass → Stage 7 → Stage 8 ACCEPT
LLM stages vs deterministic stages
Only 3 of the 9 acting stages use LLM calls:
| Stage | Type | Model used | Why LLM? |
|---|---|---|---|
| Classification | LLM | Fast model | Reading PDF content semantically to identify document type |
| Extraction | LLM | Smart model | Pulling structured fields from unstructured PDF text |
| Transformation | LLM | Fast model | Mapping vendor line-item descriptions to a standard taxonomy |
The other 6 stages are pure Python — they check conditions like "does this field exist?" or "does 38.50 + 5.78 = 44.28 within tolerance?" These do not need an LLM.
7. How LLM Calls Are Structured
Every LLM call in the system goes through a single 50-line wrapper:
# app/llm/client.py
async def complete(
*,
system: str,
user: str,
response_model: type[T] | None = None,
model: str | None = None,
temperature: float = 0.0,
) -> T | str:
"""Single entry point for ALL LLM calls in the codebase."""
if response_model is not None:
return await _aclient.chat.completions.create(
model=model or settings.smart_model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
response_model=response_model, # Instructor enforces this Pydantic model
temperature=temperature,
max_retries=2,
)
The key insight is the response_model parameter. Instead of getting free-form text from the LLM, every call returns a validated Pydantic model. Instructor handles the enforcement loop: if the LLM returns malformed JSON or a schema mismatch, it automatically re-prompts up to 2 times.
┌────────────────────────────────────────────┐
│ Instructor enforcement loop │
│ │
│ 1. Build prompt with schema description │
│ 2. Call LLM │
│ 3. Parse LLM response as Pydantic model │
│ 4. If validation error → re-prompt (×2) │
│ 5. Return validated model │
└────────────────────────────────────────────┘
For example, Stage 2 (Extraction) calls the LLM and asks it to fill in this exact Pydantic model:
class InvoiceData(BaseModel):
invoice_number: str = ""
invoice_date: str = "" # YYYY-MM-DD
invoice_total_inc_tax: float = 0.0
invoice_total_ex_tax: float = 0.0
vendor_name: str = ""
tax_id: str = ""
line_items: list[LineItem] = []
currency: str = "AUD"
The LLM cannot hallucinate extra fields or return a string — it must return a JSON object that validates against this schema, or Instructor will retry.
8. Layer 2 — The Investigation Layer
After the acting pipeline produces a verdict, the investigation layer audits the agent itself. It is an independent compliance check that asks: "Did the agent follow the rules, and did it make the right call?"
This is a key architectural decision: the system is self-auditing. It catches cases where the LLM might have missed a rule, made a borderline call, or applied inconsistent logic.
┌───────────────────────────────────────────────────────────┐
│ INVESTIGATION LAYER │
│ │
│ Layer 1: Phase Assessment Deterministic │
│ ───────────────────────────────────────────── │
│ For each acting phase (1-4): │
│ Grade: COMPLIANT / AMBIGUOUS / VIOLATION / SKIPPED │
│ No LLM — pure rule checking against the decision log │
│ │
│ Layer 2: Rule Discovery LLM (smart) + SHA cache│
│ ───────────────────────────────────────────── │
│ Parse reconstructed_rules_book.md into 7 rule groups │
│ SHA-256 of the rules book → cache key │
│ Only calls LLM once per rules book version │
│ │
│ Layer 3: Per-Group Validation LLM (smart) │
│ ───────────────────────────────────────────── │
│ For each rule group, ask LLM: │
│ "Did the agent's actions comply with these rules?" │
│ Only confirm a violation at confidence ≥ 0.95 │
└───────────────────────────────────────────────────────────┘
│
▼
InvestigationOutput:
status: PASSED | MINOR_VIOLATION | MAJOR_VIOLATION
compliance_score: 0–100
rejection_justified: bool | None
summary: str
The caching trick for rule discovery
Rule discovery is the most expensive operation — it asks the LLM to parse and structure the entire rules book. To avoid paying this cost on every run, I hash the rules book file with SHA-256 and use that as a cache key:
rules_book.md → SHA-256 → "a3f8c1..."
│
┌───────────┴──────────┐
│ │
cache hit cache miss
│ │
load from call LLM once,
rule_discovery_cache.json save to cache
Every subsequent run with the same rules book is a zero-cost cache hit. Only when the rules book changes does the LLM get called again.
9. Layer 3 — The Adaptive Learning Framework (ALF)
ALF is the most architecturally interesting component. It is a deterministic rule engine that runs after the investigation layer and can override the acting agent's decision.
┌───────────────────────────────────────────────────────────┐
│ ALF ENGINE │
│ │
│ Load rule_base.json │
│ │ │
│ ▼ │
│ For each enabled rule (sorted by priority): │
│ Evaluate conditions (AND-joined) │
│ If all pass → execute actions │
│ Per-scope mutex: only one rule fires per scope │
│ │ │
│ ▼ │
│ Deterministic actions: │
│ override_decision → force ACCEPT / REJECT │
│ set_field → patch any dotted-path field │
│ override_validation → correct a phase result │
│ recalculate_field → recompute a sum │
│ append_note → add a reason string │
│ │
│ LLM actions (rare): │
│ llm_patch_fields → surgically fix fields via LLM │
│ llm_continue_processing → re-run a phase via LLM │
└───────────────────────────────────────────────────────────┘
Why a rule engine instead of fine-tuning?
ALF embodies a key design philosophy: do not retrain to fix a mistake, encode a correction rule instead.
When the agent makes a systematic error — for example, rejecting invoices from a vendor who has a non-standard tax ID format that the validation logic doesn't recognise — a Subject Matter Expert can write one ALF rule that fixes all future cases of that type. Immediately. No retraining, no prompt engineering, no downtime.
Rule example (JSON):
{
"id": "R001",
"name": "Accept known-good vendor despite tax-ID format flag",
"scope": "decision",
"priority": 10,
"conditions": [
{ "field": "final_decision", "operator": "equals", "value": "REJECT" },
{ "field": "rejection_phase", "operator": "equals", "value": "phase3" },
{ "field": "stages.extraction.invoice_data.vendor_name",
"operator": "equals", "value": "FastTrack Logistics" }
],
"actions": [
{ "type": "override_decision", "value": "ACCEPT" },
{ "type": "append_note", "target": "rejection_reason",
"value": "Overridden: known vendor, non-standard ABN format approved." }
]
}
Scope mutual exclusion
Rules are scoped: global, phase1–phase4, transformer, decision. Only one scoped rule fires per scope per run (the highest priority matching rule). Global rules always fire alongside scoped ones. This prevents cascading rule conflicts.
10. The Learning Mode: Teaching the Agent
The Learning tab in the UI exposes a full feedback cycle for a Subject Matter Expert (SME):
┌────────────────────────────────────────────────────────────────┐
│ LEARNING LOOP │
│ │
│ 1. SME browses processed cases │
│ GET /api/learning/cases │
│ │ │
│ ▼ │
│ 2. SME finds a case where agent was wrong │
│ "This invoice should have been ACCEPTED" │
│ │ │
│ ▼ │
│ 3. SME describes the fix in plain English │
│ POST /api/learning/rules/discover │
│ │ │
│ ▼ │
│ 4. SafeRule Orchestrator (LLM-powered) │
│ a. LLM drafts an ALF rule from the description │
│ b. Validates the rule against the ALF schema │
│ c. Runs impact assessment on sample of historical cases │
│ d. If collateral effects detected → auto-tighten conditions │
│ (up to 3 attempts) │
│ │ │
│ ▼ │
│ 5. SME reviews the proposed rule and approves │
│ POST /api/learning/rules (persists to rule_base.json) │
│ │ │
│ ▼ │
│ 6. Next pipeline run picks up the new rule — no restart │
└────────────────────────────────────────────────────────────────┘
The impact assessment step is crucial for safety. Before showing the SME a proposed rule, the system runs that rule against a random sample of historical cases and checks whether it would have changed any decisions that the SME did not flag as wrong. If it finds collateral damage, it asks the LLM to tighten the rule's conditions and retries up to 3 times.
11. Streaming the Pipeline in Real Time
The React frontend shows a live timeline of the pipeline as it runs. This is implemented with Server-Sent Events (SSE) — a single persistent HTTP connection from the browser to the server over which the backend pushes events as each stage completes.
Browser FastAPI
│ │
│── GET /api/inference/stream/{id} ──▶│
│ │
│◀── event: pipeline_start ──────────│
│ │
│◀── event: stage_start │ ← Stage 1 begins
│ name: "classification" │
│ │
│◀── event: stage_complete │ ← Stage 1 done
│ name: "classification" │
│ summary: {...} │
│ │
│ ... (stages 2–9 similarly) ... │
│ │
│◀── event: pipeline_complete │ ← Full result
│ final: InferenceResponse │
│ │
│◀── event: end │ ← Close signal
The backend uses an async generator that yields events as data: {...}\n\n formatted strings. The frontend uses the browser's native EventSource API — no WebSocket library needed.
Each stage event carries enough data to update the timeline in place: the stage name, status, timing, and a brief summary. The full structured result only arrives in the final pipeline_complete event.
12. Provider Agnosticism: One Interface, Any Model
The entire LLM integration is behind a single complete() function in app/llm/client.py. The function signature is the same regardless of whether the model is GPT-4o, Claude, Gemini, or a local Ollama model.
Environment variables:
SMART_MODEL=gpt-4o → used for: extraction, investigation, ALF LLM actions
FAST_MODEL=gpt-4o-mini → used for: classification, transformation
To switch to Claude:
SMART_MODEL=claude-3-5-sonnet-20241022
FAST_MODEL=claude-3-5-haiku-20241022
To switch to Gemini:
SMART_MODEL=gemini/gemini-2.0-flash
FAST_MODEL=gemini/gemini-1.5-flash
LiteLLM translates the model string to the correct SDK and API. The rest of the codebase never imports openai, anthropic, or google.generativeai directly. This also means I can run a local Ollama model for development and switch to a cloud model for production — without touching a single line of pipeline code.
app/llm/client.py (50 lines)
│
▼
LiteLLM (provider router)
│
├──► openai API
├──► anthropic API
├──► gemini API
└──► ollama (local)
13. Why Not LangChain, LlamaIndex, or RAG?
This is a question I get often, so let me explain the decision in detail.
LangChain
LangChain's value is in tool-calling agents (ReAct loops, function calling), LCEL chains, and abstractions over retrieval. None of those are needed here.
The pipeline is a sequential async function — 9 await calls, one after another. That is cleaner and more debuggable written as plain Python than it would be as an LCEL graph. There is no dynamic routing, no tool selection, no planning loop.
LlamaIndex
LlamaIndex is primarily a RAG framework — it excels at ingesting documents, splitting them into chunks, embedding them, and retrieving relevant ones. There is no such requirement here.
The rules book is a single Markdown file that fits entirely in an LLM's context window (~15k tokens). There is no benefit to chunking it, embedding it, or retrieving parts of it — just pass the whole thing to the LLM.
RAG
RAG adds retrieval when you have too much knowledge to fit in context. In this system:
- The rules book is small → no retrieval needed
- The invoice PDF is processed inline → no retrieval needed
- Historical cases are read from the filesystem for impact assessment → no vector similarity needed (we sample randomly, not by semantic relevance)
If the system later needed to search across thousands of historical cases to find semantically similar past decisions, I would add a vector store at the impact_assessor service level. The rest of the system would be unchanged.
14. Data Flow: End to End
Invoice PDF + WAF PDF
│
▼
pdfplumber extracts text
│
▼
┌──────────────────────────────────────────────────────┐
│ ACTING PIPELINE │
│ │
│ classify → extract → validate×4 → transform → │
│ decide → audit_log │
│ │
│ Writes: 01_classification.json ... 09_audit_log.json │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ INVESTIGATION LAYER (optional) │
│ │
│ Layer 1: phase assessment (deterministic) │
│ Layer 2: rule discovery (LLM + SHA cache) │
│ Layer 3: per-group validation (LLM) │
│ │
│ Writes: 10_investigation.json │
└──────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ ALF ENGINE (optional) │
│ │
│ Load rule_base.json │
│ Evaluate conditions (deterministic) │
│ Execute actions (deterministic or LLM) │
│ │
│ Writes: 11_alf.json, Postprocessing_Data.json │
└──────────────────┬───────────────────────────────────┘
│
▼
InferenceResponse
(JSON to client + SSE events during run)
Every intermediate artifact is a versioned JSON file on disk. This means:
- The Learning mode can replay any historical case
- The investigation layer can audit without re-running the pipeline
- ALF rules can be developed and tested against real historical outputs
15. Key Design Decisions
1. Pure Python functions, not agent frameworks
Every pipeline stage is an async def that takes a Pydantic model and returns a Pydantic model. There are no base classes to inherit from, no @tool decorators, no framework-specific primitives. This makes stages trivially unit-testable — pass in a model, assert the output.
2. Instructor for zero-tolerance LLM output
LLM responses are not parsed manually. Instructor enforces Pydantic schemas with automatic re-prompting. If the LLM returns garbage JSON, the system retries up to 2 times before raising. This eliminates an entire class of production bugs that would otherwise require defensive parsing code everywhere.
3. Two LLM tiers, not one
Using a fast (cheaper, lower-latency) model for classification and transformation, and a smart (more capable) model for extraction and investigation, cuts cost and latency significantly. The 9-stage pipeline makes 3–4 LLM calls in the "fast" path — most of the wall-clock time is network I/O, not compute.
4. Correction rules instead of fine-tuning
ALF rules are committed to a JSON file, version-controlled, reviewable, and immediately effective. Fine-tuning would require data preparation, training time, evaluation, and deployment. For the types of systematic errors an invoice agent makes (wrong taxonomy, edge-case validation logic), a precise correction rule is almost always the right fix.
5. Every stage writes an artifact
Every stage's output is persisted to disk as a JSON file with a consistent naming convention (01_classification.json … 11_alf.json). This is not just for debugging — the Learning mode reads these artifacts to show SMEs exactly what the agent saw and decided at each step. Transparency is a product feature.
6. FastAPI + React, not a monolith
The original system was a CLI-based Google ADK agent. Converting to FastAPI + React SPA separates concerns cleanly: the backend owns all AI logic, the frontend owns all presentation. The HTTP boundary also makes it possible to run the acting pipeline independently via curl — useful for batch processing and CI integration.
The full source code including the FastAPI backend, React frontend, and all pipeline stages is part of a private AI engineering project portfolio. The architecture described here represents the complete working system.