Building an AI Agent for Invoice Processing: Architecture Deep Dive

The Problem
Is This RAG?
System Category: Agentic Pipeline
High-Level Architecture
The Technology Stack
Layer 1 — The Acting Pipeline (11 Stages)
How LLM Calls Are Structured
Layer 2 — The Investigation Layer
Layer 3 — The Adaptive Learning Framework (ALF)
The Learning Mode: Teaching the Agent
Streaming the Pipeline in Real Time
Provider Agnosticism: One Interface, Any Model
Why Not LangChain, LlamaIndex, or RAG?
Data Flow: End to End
Key Design Decisions

1. The Problem

Companies receive hundreds of invoices every month. Each one is a PDF that a human finance team member must manually:

Read the numbers, dates, and line items out of the PDF
Verify that the maths adds up correctly
Check that the invoice complies with the pre-authorized work order (the WAF — Work Authorization Form)
Decide: pay it or send it back with a rejection reason

This is repetitive, slow, and error-prone at scale. I built an AI agent that automates the entire workflow — upload a PDF, get a structured ACCEPT or REJECT verdict in seconds, with a full explanation and audit trail.

What makes it interesting architecturally is not just the automation, but how I designed the system to:

Be transparent about its own reasoning (the investigation layer audits the agent itself)
Learn from human corrections without retraining (the adaptive learning framework)
Work with any LLM provider by changing one environment variable
Never become a black box — every stage writes a structured JSON artifact to disk

2. Is This RAG?

No. This is one of the most important clarifications I can make about the architecture.

RAG (Retrieval-Augmented Generation) is a pattern where, for each query, you:

Convert the query into a vector embedding
Search a vector database for the closest matching chunks of stored knowledge
Inject those chunks as context into the LLM prompt
Let the LLM answer using that retrieved context

RAG pattern:
Query → Embed → Vector DB search → Top-K chunks → LLM prompt → Answer

This system does none of that. There is no vector database, no embedding model, no similarity search.

Instead, this is a document processing pipeline combined with a rule-based correction engine. The "knowledge" in this system is:

Knowledge type	Where it lives	How it's used
Business rules	`reconstructed_rules_book.md`	Loaded once, fits in one LLM context window
Correction rules	`rule_base.json`	Deterministic JSON rules evaluated at runtime
Invoice data	Uploaded PDFs	Parsed fresh per run by pdfplumber
Historical cases	Filesystem (`agent_output/`)	Read by the Learning mode for impact assessment

The rules book is small enough to fit entirely inside an LLM prompt — so there is no need to retrieve chunks of it. That is a deliberate architectural choice that keeps the system simpler and faster than RAG would be.

If the rules book grew to thousands of pages, I would add RAG at the investigation layer only. The acting pipeline would remain unchanged.

3. System Category: Agentic Pipeline

The correct category for this architecture is an agentic pipeline with structured LLM calls.

┌─────────────────────────────────────────────────────┐
│              Architectural Pattern                   │
│                                                      │
│  Agentic Pipeline — NOT:                            │
│    ✗ RAG (no vector retrieval)                      │
│    ✗ ReAct agent (no tool-calling loop)             │
│    ✗ Multi-agent debate (one pipeline, one agent)   │
│                                                      │
│  IS:                                                 │
│    ✓ Sequential LLM pipeline                        │
│    ✓ Structured output extraction (Instructor)      │
│    ✓ Deterministic validation gates                 │
│    ✓ LLM-audited compliance layer                   │
│    ✓ Rule-based correction engine (ALF)             │
│    ✓ Human-in-the-loop learning mode                │
└─────────────────────────────────────────────────────┘

The LLM is not "thinking freely" or "choosing what to do next" — each call has a strict Pydantic schema it must return, enforced by the Instructor library with automatic retries. The LLM is used as a structured data extractor and semantic reasoner, not as an autonomous agent.

4. High-Level Architecture

┌───────────────────────────────────────────────────────────────────┐
│                        User / SME                                  │
│   Upload PDF                         Review cases, write rules     │
└────────────┬──────────────────────────────────────────────────────┘
             │ multipart/form-data                  │ REST JSON
             ▼                                      ▼
┌─────────────────────────────────────────────────────────────────┐
│                    FastAPI (async, uvicorn)                       │
│   POST /api/inference/run          /api/learning/*               │
│   GET  /api/inference/stream/{id}  SSE events                    │
└───────────────────┬───────────────────────────────┬─────────────┘
                    │                               │
                    ▼                               ▼
        ┌───────────────────────┐    ┌───────────────────────────┐
        │   Acting Pipeline     │    │   Learning Mode           │
        │   (11 stages)         │    │   SafeRule Orchestrator   │
        └───────────┬───────────┘    └───────────────────────────┘
                    │
          ┌─────────┼─────────┐
          ▼         ▼         ▼
   ┌──────────┐ ┌──────────┐ ┌──────────┐
   │ LiteLLM  │ │pdfplumber│ │rule_base │
   │+Instructor│ │ PDF parse│ │  .json   │
   └──────────┘ └──────────┘ └──────────┘
          │
  ┌───────┴────────┐
  │  OpenAI        │
  │  Anthropic     │  ← swapped via SMART_MODEL / FAST_MODEL env vars
  │  Gemini        │
  │  Ollama        │
  └────────────────┘

The frontend is a React 19 + Vite SPA that communicates with the FastAPI backend over JSON REST and Server-Sent Events for the live pipeline stream.

5. The Technology Stack

I deliberately kept this stack minimal and replaceable. The philosophy: use open-source libraries for what they solve well; do not add frameworks just to have a framework.

Concern	Library	Why
Web framework	FastAPI	Async, OpenAPI docs, Pydantic-native
LLM provider abstraction	LiteLLM	One interface, 100+ providers — the Python equivalent of Laravel Prism
Structured LLM output	Instructor	Wraps LiteLLM with Pydantic validation + auto-retry on bad JSON
Settings	pydantic-settings	Typed `.env` loading with zero boilerplate
PDF parsing	pdfplumber	Reliable text extraction from invoices
Logging	structlog	JSON-structured logs per stage
Frontend	React 19 + Vite + TypeScript	React Compiler, `useActionState`, native SSE

What I deliberately excluded:

Library	Why excluded
LangChain	No chains or tool-calling needed; plain `async def` is cleaner
LlamaIndex	No vector retrieval; rules book fits in context
Vector DB	No RAG; deterministic rules + LLM reasoning is sufficient
Celery / Redis	Out of scope for v1; add only if pipeline becomes background-queued

6. Layer 1 — The Acting Pipeline (11 Stages)

The core of the system. When a PDF is uploaded, it passes through 11 sequential stages:

PDF files
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│                   ACTING PIPELINE                            │
│                                                              │
│  Stage 1  🗂️  Classification      LLM (fast model)          │
│               ↓ Which PDF is the invoice? Which is the WAF? │
│                                                              │
│  Stage 2  🔍  Extraction           LLM (smart model)        │
│               ↓ Extract all fields into Pydantic models      │
│                                                              │
│  Stage 3  📌  Phase 1 Validation   Deterministic             │
│               ↓ Are mandatory fields present?                │
│                                                              │
│  Stage 4  📋  Phase 2 Validation   Deterministic             │
│               ↓ Are line items present and valid?            │
│                                                              │
│  Stage 5  📅  Phase 3 Validation   Deterministic             │
│               ↓ Are dates valid? Is tax ID format correct?   │
│                                                              │
│  Stage 6  🔢  Phase 4 Validation   Deterministic             │
│               ↓ Do totals balance? Do hours match the WAF?   │
│                                                              │
│  Stage 7  🔄  Transformation       LLM (fast model)          │
│               ↓ Map line items to standard 9-code taxonomy   │
│                                                              │
│  Stage 8  ⚖️   Decision             Deterministic             │
│               ↓ ACCEPT or REJECT                             │
│                                                              │
│  Stage 9  📝  Audit Log            Deterministic             │
│               ↓ Write all artifacts to disk                  │
│                                                              │
│  Stage 10 🔎  Investigation        LLM (smart model) — opt.  │
│               ↓ Did the agent follow the rules book?         │
│                                                              │
│  Stage 11 🤖  ALF Rule Engine      Deterministic + opt. LLM  │
│               ↓ Apply learned correction rules               │
└─────────────────────────────────────────────────────────────┘
    │
    ▼
InferenceResponse (Pydantic model → JSON to client)

Short-circuit logic

Validation stages 3–6 can short-circuit the pipeline. If Phase 1 validation fails (e.g. the invoice is missing a vendor name), the pipeline jumps directly to Stage 8 (Decision = REJECT), skipping stages 4–7. This mimics how a human reviewer would stop reading after finding a fundamental problem.

Stage 3 FAIL ──────────────────────────────────→ Stage 8 REJECT
Stage 4 FAIL ───────────────────────────────────→ Stage 8 REJECT
Stage 5 FAIL ────────────────────────────────────→ Stage 8 REJECT
Stage 6 FAIL ─────────────────────────────────────→ Stage 8 REJECT
All pass     → Stage 7 → Stage 8 ACCEPT

LLM stages vs deterministic stages

Only 3 of the 9 acting stages use LLM calls:

Stage	Type	Model used	Why LLM?
Classification	LLM	Fast model	Reading PDF content semantically to identify document type
Extraction	LLM	Smart model	Pulling structured fields from unstructured PDF text
Transformation	LLM	Fast model	Mapping vendor line-item descriptions to a standard taxonomy

The other 6 stages are pure Python — they check conditions like "does this field exist?" or "does 38.50 + 5.78 = 44.28 within tolerance?" These do not need an LLM.

7. How LLM Calls Are Structured

Every LLM call in the system goes through a single 50-line wrapper:

# app/llm/client.py
async def complete(
    *,
    system: str,
    user: str,
    response_model: type[T] | None = None,
    model: str | None = None,
    temperature: float = 0.0,
) -> T | str:
    """Single entry point for ALL LLM calls in the codebase."""
    if response_model is not None:
        return await _aclient.chat.completions.create(
            model=model or settings.smart_model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ],
            response_model=response_model,  # Instructor enforces this Pydantic model
            temperature=temperature,
            max_retries=2,
        )

The key insight is the response_model parameter. Instead of getting free-form text from the LLM, every call returns a validated Pydantic model. Instructor handles the enforcement loop: if the LLM returns malformed JSON or a schema mismatch, it automatically re-prompts up to 2 times.

┌────────────────────────────────────────────┐
│        Instructor enforcement loop          │
│                                            │
│  1. Build prompt with schema description   │
│  2. Call LLM                               │
│  3. Parse LLM response as Pydantic model   │
│  4. If validation error → re-prompt (×2)  │
│  5. Return validated model                 │
└────────────────────────────────────────────┘

For example, Stage 2 (Extraction) calls the LLM and asks it to fill in this exact Pydantic model:

class InvoiceData(BaseModel):
    invoice_number: str = ""
    invoice_date: str = ""          # YYYY-MM-DD
    invoice_total_inc_tax: float = 0.0
    invoice_total_ex_tax: float = 0.0
    vendor_name: str = ""
    tax_id: str = ""
    line_items: list[LineItem] = []
    currency: str = "AUD"

The LLM cannot hallucinate extra fields or return a string — it must return a JSON object that validates against this schema, or Instructor will retry.

8. Layer 2 — The Investigation Layer

After the acting pipeline produces a verdict, the investigation layer audits the agent itself. It is an independent compliance check that asks: "Did the agent follow the rules, and did it make the right call?"

This is a key architectural decision: the system is self-auditing. It catches cases where the LLM might have missed a rule, made a borderline call, or applied inconsistent logic.

┌───────────────────────────────────────────────────────────┐
│                 INVESTIGATION LAYER                        │
│                                                            │
│  Layer 1: Phase Assessment        Deterministic           │
│  ─────────────────────────────────────────────            │
│  For each acting phase (1-4):                             │
│    Grade: COMPLIANT / AMBIGUOUS / VIOLATION / SKIPPED     │
│    No LLM — pure rule checking against the decision log   │
│                                                            │
│  Layer 2: Rule Discovery           LLM (smart) + SHA cache│
│  ─────────────────────────────────────────────            │
│  Parse reconstructed_rules_book.md into 7 rule groups     │
│  SHA-256 of the rules book → cache key                    │
│  Only calls LLM once per rules book version               │
│                                                            │
│  Layer 3: Per-Group Validation     LLM (smart)            │
│  ─────────────────────────────────────────────            │
│  For each rule group, ask LLM:                            │
│    "Did the agent's actions comply with these rules?"     │
│    Only confirm a violation at confidence ≥ 0.95          │
└───────────────────────────────────────────────────────────┘
         │
         ▼
  InvestigationOutput:
    status: PASSED | MINOR_VIOLATION | MAJOR_VIOLATION
    compliance_score: 0–100
    rejection_justified: bool | None
    summary: str

The caching trick for rule discovery

Rule discovery is the most expensive operation — it asks the LLM to parse and structure the entire rules book. To avoid paying this cost on every run, I hash the rules book file with SHA-256 and use that as a cache key:

rules_book.md → SHA-256 → "a3f8c1..."
                                │
                    ┌───────────┴──────────┐
                    │                      │
              cache hit               cache miss
                    │                      │
              load from             call LLM once,
           rule_discovery_cache.json  save to cache

Every subsequent run with the same rules book is a zero-cost cache hit. Only when the rules book changes does the LLM get called again.

9. Layer 3 — The Adaptive Learning Framework (ALF)

ALF is the most architecturally interesting component. It is a deterministic rule engine that runs after the investigation layer and can override the acting agent's decision.

┌───────────────────────────────────────────────────────────┐
│              ALF ENGINE                                    │
│                                                            │
│  Load rule_base.json                                       │
│         │                                                  │
│         ▼                                                  │
│  For each enabled rule (sorted by priority):               │
│    Evaluate conditions (AND-joined)                        │
│    If all pass → execute actions                           │
│    Per-scope mutex: only one rule fires per scope          │
│         │                                                  │
│         ▼                                                  │
│  Deterministic actions:                                    │
│    override_decision   → force ACCEPT / REJECT             │
│    set_field           → patch any dotted-path field       │
│    override_validation → correct a phase result            │
│    recalculate_field   → recompute a sum                   │
│    append_note         → add a reason string               │
│                                                            │
│  LLM actions (rare):                                       │
│    llm_patch_fields    → surgically fix fields via LLM     │
│    llm_continue_processing → re-run a phase via LLM        │
└───────────────────────────────────────────────────────────┘

Why a rule engine instead of fine-tuning?

ALF embodies a key design philosophy: do not retrain to fix a mistake, encode a correction rule instead.

When the agent makes a systematic error — for example, rejecting invoices from a vendor who has a non-standard tax ID format that the validation logic doesn't recognise — a Subject Matter Expert can write one ALF rule that fixes all future cases of that type. Immediately. No retraining, no prompt engineering, no downtime.

Rule example (JSON):
{
  "id": "R001",
  "name": "Accept known-good vendor despite tax-ID format flag",
  "scope": "decision",
  "priority": 10,
  "conditions": [
    { "field": "final_decision", "operator": "equals", "value": "REJECT" },
    { "field": "rejection_phase", "operator": "equals", "value": "phase3" },
    { "field": "stages.extraction.invoice_data.vendor_name",
      "operator": "equals", "value": "FastTrack Logistics" }
  ],
  "actions": [
    { "type": "override_decision", "value": "ACCEPT" },
    { "type": "append_note", "target": "rejection_reason",
      "value": "Overridden: known vendor, non-standard ABN format approved." }
  ]
}

Scope mutual exclusion

Rules are scoped: global, phase1–phase4, transformer, decision. Only one scoped rule fires per scope per run (the highest priority matching rule). Global rules always fire alongside scoped ones. This prevents cascading rule conflicts.

10. The Learning Mode: Teaching the Agent

The Learning tab in the UI exposes a full feedback cycle for a Subject Matter Expert (SME):

┌────────────────────────────────────────────────────────────────┐
│                    LEARNING LOOP                                │
│                                                                 │
│  1. SME browses processed cases                                 │
│     GET /api/learning/cases                                     │
│            │                                                    │
│            ▼                                                    │
│  2. SME finds a case where agent was wrong                      │
│     "This invoice should have been ACCEPTED"                    │
│            │                                                    │
│            ▼                                                    │
│  3. SME describes the fix in plain English                      │
│     POST /api/learning/rules/discover                           │
│            │                                                    │
│            ▼                                                    │
│  4. SafeRule Orchestrator (LLM-powered)                         │
│     a. LLM drafts an ALF rule from the description              │
│     b. Validates the rule against the ALF schema                │
│     c. Runs impact assessment on sample of historical cases     │
│     d. If collateral effects detected → auto-tighten conditions │
│        (up to 3 attempts)                                       │
│            │                                                    │
│            ▼                                                    │
│  5. SME reviews the proposed rule and approves                  │
│     POST /api/learning/rules   (persists to rule_base.json)     │
│            │                                                    │
│            ▼                                                    │
│  6. Next pipeline run picks up the new rule — no restart        │
└────────────────────────────────────────────────────────────────┘

The impact assessment step is crucial for safety. Before showing the SME a proposed rule, the system runs that rule against a random sample of historical cases and checks whether it would have changed any decisions that the SME did not flag as wrong. If it finds collateral damage, it asks the LLM to tighten the rule's conditions and retries up to 3 times.

11. Streaming the Pipeline in Real Time

The React frontend shows a live timeline of the pipeline as it runs. This is implemented with Server-Sent Events (SSE) — a single persistent HTTP connection from the browser to the server over which the backend pushes events as each stage completes.

Browser                              FastAPI
   │                                    │
   │── GET /api/inference/stream/{id} ──▶│
   │                                    │
   │◀── event: pipeline_start ──────────│
   │                                    │
   │◀── event: stage_start              │  ← Stage 1 begins
   │      name: "classification"        │
   │                                    │
   │◀── event: stage_complete           │  ← Stage 1 done
   │      name: "classification"        │
   │      summary: {...}                │
   │                                    │
   │    ... (stages 2–9 similarly) ...  │
   │                                    │
   │◀── event: pipeline_complete        │  ← Full result
   │      final: InferenceResponse      │
   │                                    │
   │◀── event: end                      │  ← Close signal

The backend uses an async generator that yields events as data: {...}\n\n formatted strings. The frontend uses the browser's native EventSource API — no WebSocket library needed.

Each stage event carries enough data to update the timeline in place: the stage name, status, timing, and a brief summary. The full structured result only arrives in the final pipeline_complete event.

12. Provider Agnosticism: One Interface, Any Model

The entire LLM integration is behind a single complete() function in app/llm/client.py. The function signature is the same regardless of whether the model is GPT-4o, Claude, Gemini, or a local Ollama model.

Environment variables:
  SMART_MODEL=gpt-4o             → used for: extraction, investigation, ALF LLM actions
  FAST_MODEL=gpt-4o-mini         → used for: classification, transformation

To switch to Claude:
  SMART_MODEL=claude-3-5-sonnet-20241022
  FAST_MODEL=claude-3-5-haiku-20241022

To switch to Gemini:
  SMART_MODEL=gemini/gemini-2.0-flash
  FAST_MODEL=gemini/gemini-1.5-flash

LiteLLM translates the model string to the correct SDK and API. The rest of the codebase never imports openai, anthropic, or google.generativeai directly. This also means I can run a local Ollama model for development and switch to a cloud model for production — without touching a single line of pipeline code.

app/llm/client.py  (50 lines)
        │
        ▼
   LiteLLM (provider router)
        │
        ├──► openai API
        ├──► anthropic API
        ├──► gemini API
        └──► ollama (local)

13. Why Not LangChain, LlamaIndex, or RAG?

This is a question I get often, so let me explain the decision in detail.

LangChain

LangChain's value is in tool-calling agents (ReAct loops, function calling), LCEL chains, and abstractions over retrieval. None of those are needed here.

The pipeline is a sequential async function — 9 await calls, one after another. That is cleaner and more debuggable written as plain Python than it would be as an LCEL graph. There is no dynamic routing, no tool selection, no planning loop.

LlamaIndex

LlamaIndex is primarily a RAG framework — it excels at ingesting documents, splitting them into chunks, embedding them, and retrieving relevant ones. There is no such requirement here.

The rules book is a single Markdown file that fits entirely in an LLM's context window (~15k tokens). There is no benefit to chunking it, embedding it, or retrieving parts of it — just pass the whole thing to the LLM.

RAG

RAG adds retrieval when you have too much knowledge to fit in context. In this system:

The rules book is small → no retrieval needed
The invoice PDF is processed inline → no retrieval needed
Historical cases are read from the filesystem for impact assessment → no vector similarity needed (we sample randomly, not by semantic relevance)

If the system later needed to search across thousands of historical cases to find semantically similar past decisions, I would add a vector store at the impact_assessor service level. The rest of the system would be unchanged.

14. Data Flow: End to End

Invoice PDF + WAF PDF
        │
        ▼
  pdfplumber extracts text
        │
        ▼
┌──────────────────────────────────────────────────────┐
│ ACTING PIPELINE                                       │
│                                                       │
│  classify → extract → validate×4 → transform →       │
│  decide → audit_log                                   │
│                                                       │
│  Writes: 01_classification.json ... 09_audit_log.json │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│ INVESTIGATION LAYER (optional)                        │
│                                                       │
│  Layer 1: phase assessment (deterministic)            │
│  Layer 2: rule discovery (LLM + SHA cache)            │
│  Layer 3: per-group validation (LLM)                  │
│                                                       │
│  Writes: 10_investigation.json                        │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────┐
│ ALF ENGINE (optional)                                 │
│                                                       │
│  Load rule_base.json                                  │
│  Evaluate conditions (deterministic)                  │
│  Execute actions (deterministic or LLM)               │
│                                                       │
│  Writes: 11_alf.json, Postprocessing_Data.json        │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
         InferenceResponse
         (JSON to client + SSE events during run)

Every intermediate artifact is a versioned JSON file on disk. This means:

The Learning mode can replay any historical case
The investigation layer can audit without re-running the pipeline
ALF rules can be developed and tested against real historical outputs

15. Key Design Decisions

1. Pure Python functions, not agent frameworks

Every pipeline stage is an async def that takes a Pydantic model and returns a Pydantic model. There are no base classes to inherit from, no @tool decorators, no framework-specific primitives. This makes stages trivially unit-testable — pass in a model, assert the output.

2. Instructor for zero-tolerance LLM output

LLM responses are not parsed manually. Instructor enforces Pydantic schemas with automatic re-prompting. If the LLM returns garbage JSON, the system retries up to 2 times before raising. This eliminates an entire class of production bugs that would otherwise require defensive parsing code everywhere.

3. Two LLM tiers, not one

Using a fast (cheaper, lower-latency) model for classification and transformation, and a smart (more capable) model for extraction and investigation, cuts cost and latency significantly. The 9-stage pipeline makes 3–4 LLM calls in the "fast" path — most of the wall-clock time is network I/O, not compute.

4. Correction rules instead of fine-tuning

ALF rules are committed to a JSON file, version-controlled, reviewable, and immediately effective. Fine-tuning would require data preparation, training time, evaluation, and deployment. For the types of systematic errors an invoice agent makes (wrong taxonomy, edge-case validation logic), a precise correction rule is almost always the right fix.

5. Every stage writes an artifact

Every stage's output is persisted to disk as a JSON file with a consistent naming convention (01_classification.json … 11_alf.json). This is not just for debugging — the Learning mode reads these artifacts to show SMEs exactly what the agent saw and decided at each step. Transparency is a product feature.

6. FastAPI + React, not a monolith

The original system was a CLI-based Google ADK agent. Converting to FastAPI + React SPA separates concerns cleanly: the backend owns all AI logic, the frontend owns all presentation. The HTTP boundary also makes it possible to run the acting pipeline independently via curl — useful for batch processing and CI integration.

The full source code including the FastAPI backend, React frontend, and all pipeline stages is part of a private AI engineering project portfolio. The architecture described here represents the complete working system.

Table of Contents