Enterprise AI has a new center of gravity. In 2025, retrieval-augmented generation (RAG) graduated from proof-of-concept to production workload (LangChain, “LangChain Documentation,” 2025). In 2026, the pattern that matters most is what happens before the chatbot answers: extracting structured knowledge from messy documents and letting an agent act on it. This article maps the full stack—from layout-aware parsing to tool-using chatbot agents—for engineers building or evaluating these systems today.
Key Takeaways
- Document extraction + chatbot agent is a single end-to-end pipeline: ingest → parse → structure → retrieve → reason → act.
- Multimodal models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5) changed the game by handling images, tables, and text in one pass (OpenAI, “GPT-4o System Card,” 2024).
- Open-source tooling matured rapidly: Docling (~52.9k GitHub stars), MarkItDown (~87k stars), and Unstructured (~14k stars) each address different parts of the extraction problem (GitHub, Feb 2026).
- Agent frameworks (LangGraph, CrewAI, OpenAI Agents SDK) now provide first-class support for tool use, guardrails, and observability (LangGraph Docs, 2025; OpenAI Agents SDK Docs, 2025).
- Evaluation remains the hardest unsolved problem: field-level F1 for extraction and groundedness metrics for QA both require dedicated tooling (DeepEval Docs, 2025; RAGAS Docs, 2025).
Definition + Why Now (2025→2026)
What It Means in Practice
A Document Extraction + Chatbot Agent system converts unstructured documents (PDFs, scans, Office files, emails) into structured data, indexes that data for retrieval, and exposes it through a conversational agent that can answer questions, execute workflows, and cite its sources. The end-to-end pipeline looks like:
Documents (PDF, DOCX, images, email)
→ Ingestion (watch folder / API / event bus)
→ Extraction (OCR, layout parsing, table detection, entity extraction)
→ Structuring (JSON schema validation, field normalization)
→ Indexing (vector embeddings + metadata in hybrid search store)
→ Agent (retrieval → reasoning → tool calls → response with citations)
Why It Surged
Several forces converged between 2025 and 2026:
- RAG maturity. By mid-2025, LangChain (~126.6k stars) and LlamaIndex (~47k stars) had stabilized their retrieval APIs and integrations, making RAG a commodity (GitHub star counts, Feb 2026; GitHub, Feb 2026).
- Multimodal models. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro can natively process images of document pages, reducing dependence on OCR for many use cases (OpenAI, “GPT-4o System Card,” 2024; Anthropic, “Claude 3.5 Sonnet,” 2024).
- Agent frameworks. LangGraph (~24.7k stars), CrewAI (~44k stars), and OpenAI Agents SDK (~18.9k stars) gave teams production-grade patterns for tool use, handoffs, and guardrails (GitHub, Feb 2026; GitHub, Feb 2026; GitHub, Feb 2026).
- Compliance pressure. Regulations in financial services, healthcare, and government increasingly require auditability of AI-generated answers—pushing teams toward cite-your-source architectures (AWS Prescriptive Guidance, “Agentic AI Patterns,” 2025).
- Automation ROI. Document-heavy workflows (invoicing, claims, onboarding) represent some of the highest-value automation targets; enterprises see measurable cost savings when extraction + agent pipelines replace manual review.
Architecture Patterns
Pattern A: OCR/Parse → Chunking → Embeddings → RAG Chatbot
This is the classic RAG pipeline, now well-understood and widely deployed.
Text description of diagram:
┌─────────────┐ ┌──────────────┐ ┌────────────┐ ┌──────────────┐ ┌──────────┐
│ Documents │───▶│ OCR / Parse │───▶│ Chunking │───▶│ Embeddings │───▶│ Vector DB│
│ (PDF, DOCX) │ │ (Docling, │ │ (recursive,│ │ (OpenAI, │ │ (Qdrant, │
│ │ │ Unstructured│ │ semantic) │ │ Cohere) │ │ Chroma) │
└─────────────┘ └──────────────┘ └────────────┘ └──────────────┘ └──────────┘
│
▼
┌──────────────┐ ┌────────────┐ ┌──────────────┐ ┌──────────┐
│ Response │◀───│ LLM │◀───│ Retrieved │◀───│ Query │
│ (with cites) │ │ (GPT-4o, │ │ Chunks │ │ Embedding│
│ │ │ Claude) │ │ │ │ │
└──────────────┘ └────────────┘ └──────────────┘ └──────────┘
Strengths: simple, well-supported, works for text-heavy documents. Weaknesses: loses table structure, struggles with scanned forms, no structured extraction.
Pattern B: Multimodal Document Understanding → Structured Extraction → Tool-Using Agent
This pattern leverages multimodal LLMs to understand document layout natively, then produces structured JSON that an agent can act on.
Text description of diagram:
┌─────────────┐ ┌───────────────────┐ ┌─────────────────┐ ┌────────────────┐
│ Documents │───▶│ Multimodal LLM │───▶│ Structured JSON │───▶│ Agent Layer │
│ (scans, │ │ (GPT-4o vision, │ │ (schema-validated│ │ (function call,│
│ photos, │ │ Claude 3.5 vision│ │ Pydantic model) │ │ tool use, │
│ native PDF) │ │ + layout prompt) │ │ │ │ guardrails) │
└─────────────┘ └───────────────────┘ └─────────────────┘ └────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Metadata DB │ │ External │
│ (system of │ │ Tools/APIs │
│ record) │ │ (ERP, CRM) │
└──────────────┘ └──────────────┘
Strengths: handles tables, forms, and mixed layouts; outputs structured data directly; supports downstream automation. Weaknesses: higher cost per document; requires schema design; accuracy depends heavily on prompt engineering.
Pattern C: Event-Driven Ingestion + Continuous Re-indexing + Eval Harness
For production systems processing documents continuously, Pattern C adds event-driven ingestion, automatic re-indexing on schema changes, and a built-in evaluation harness using golden documents.
This pattern follows the architecture described in AWS Prescriptive Guidance for agentic systems: EventBridge triggers Step Functions workflows, extraction outputs are validated against held-out test sets, and confidence scores route low-quality extractions to human review. The eval harness runs nightly regression on canary documents to detect model drift.
Core Components
Ingestion
| Source | Format | Challenge |
|---|---|---|
| Scanned documents | TIFF, JPEG, PDF (image-only) | OCR quality, skew correction |
| Native PDFs | PDF 1.7+ | Layout detection, embedded fonts |
| Office documents | DOCX, XLSX, PPTX | Preserving table/list structure |
| Emails | EML, MSG | Attachments, threading, HTML body |
Tools: Docling (~52.9k stars) handles PDF, DOCX, PPTX, and XLSX with layout-aware parsing (Auer et al., “Docling Technical Report,” arXiv:2408.09869, 2024). MarkItDown (~87k stars) converts Office docs and PDFs to Markdown optimized for LLM consumption (Microsoft, “MarkItDown README,” 2024). Unstructured (~14k stars) provides a partitioning API that detects titles, tables, and narrative text across dozens of formats (Unstructured Docs, 2025).
Extraction
Extraction goes beyond OCR to produce structured output:
- Layout parsing: Detecting headings, paragraphs, tables, figures, and their reading order.
- Table understanding: Recognizing row/column spans, headers, and merged cells.
- Key-value extraction: Pulling fields like “Invoice Number,” “Date,” “Total Amount” from forms.
- Entity recognition: Identifying people, organizations, dates, and monetary amounts.
- Citation to source: Mapping each extracted field back to its page number and bounding box.
Knowledge Store: Vector DB + Metadata Store; Hybrid Search
Modern retrieval combines vector similarity with keyword search (BM25) for hybrid queries. Leading vector databases:
- Qdrant (~28.8k stars): Rust-based, supports filtering with payload indexes, built-in hybrid search (Qdrant Docs, 2025).
- Chroma (~26.1k stars): Python/JS-friendly, simple API, serverless cloud option (Chroma Docs, 2025).
- Weaviate (~15.6k stars): Go-based, supports native vectorization with integrated model providers (Weaviate Docs, 2025).
Hybrid search pattern: query both a BM25 index and a vector index, then fuse results using reciprocal rank fusion (RRF) before passing to the LLM (Weaviate, “Hybrid Search,” 2025).
Agent Layer
The agent layer adds reasoning, tool use, and workflow orchestration on top of retrieval:
- Tool use: Agents call functions to query databases, update records, or trigger workflows.
- Retrieval as a tool: The RAG pipeline itself becomes a tool the agent invokes when it needs document context.
- Function calling: OpenAI’s function calling API and Anthropic’s tool use API allow structured tool invocations with validated inputs/outputs (OpenAI, “Function Calling,” 2025; Anthropic, “Tool Use,” 2025).
- Guardrails: Input/output validation, PII detection, and prompt injection defenses (OpenAI Agents SDK, “Guardrails,” 2025).
Observability
- Evaluation frameworks: DeepEval (~13.6k stars) and RAGAS (~12.6k stars) provide metrics for faithfulness, answer relevancy, and context precision (DeepEval GitHub, Feb 2026; RAGAS GitHub, Feb 2026).
- Tracing: LangSmith, OpenAI’s built-in tracing for Agents SDK, and open-source alternatives like Phoenix by Arize track every LLM call, retrieval, and tool invocation.
- Hallucination detection: Faithfulness metrics compare the agent’s answer against retrieved source chunks.
- Human-in-the-loop: Confidence thresholds route low-certainty extractions/answers to human reviewers.
Key Techniques and What Changed in 2025–2026
Multimodal Document QA
Instead of OCR → text → LLM, multimodal models accept page images directly. This eliminates OCR error propagation and preserves visual layout cues like bolding, color coding, and spatial relationships. GPT-4o and Claude 3.5 Sonnet both demonstrated strong performance on document QA benchmarks involving tables and forms (OpenAI, “GPT-4o System Card,” 2024).
Structured Outputs and Function Calling
OpenAI’s structured outputs feature (released 2024, widely adopted in 2025) forces the model to return JSON matching a provided schema, eliminating parsing failures (OpenAI, “Structured Outputs,” 2024). Combined with function calling, agents can now reliably invoke tools with validated arguments.
”Cite-Your-Source” UX Patterns
Production systems now routinely require the LLM to cite the specific document, page, and section for every claim. Implementation involves: (1) including source metadata in each retrieved chunk, (2) instructing the model to reference specific chunk IDs, and (3) rendering inline citations in the UI with links to the source document.
Security and Compliance
- PII redaction: Applied at ingestion (before indexing) and at response time (before display).
- Access control: Document-level and field-level permissions enforced at the retrieval layer.
- Audit trails: Every query, retrieval, and LLM response logged for compliance review.
- Prompt injection via documents: A 2025-era threat where adversarial text embedded in ingested documents attempts to hijack the agent’s behavior. Mitigations include input sanitization and guardrails (OWASP, “LLM Top 10,” 2025).
Code Snippets
Snippet 1: Extraction → Schema Validation
from pydantic import BaseModel, Field
from docling.document_converter import DocumentConverter
class Invoice(BaseModel):
invoice_number: str = Field(..., description="Unique invoice ID")
vendor_name: str
total_amount: float
currency: str = "USD"
line_items: list[dict] = Field(default_factory=list)
# 1. Extract document structure
converter = DocumentConverter()
result = converter.convert("invoice_scan.pdf")
markdown_text = result.document.export_to_markdown()
# 2. Use LLM with structured output to extract fields
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract invoice fields from the document text. Return valid JSON."},
{"role": "user", "content": markdown_text},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "Invoice",
"schema": Invoice.model_json_schema(),
},
},
)
# 3. Validate against schema
invoice = Invoice.model_validate_json(response.choices[0].message.content)
print(f"Invoice {invoice.invoice_number}: {invoice.currency} {invoice.total_amount}")
Snippet 2: Retrieval + Citations + Agent Tool Call
from agents import Agent, Runner, function_tool
from openai import OpenAI
from qdrant_client import QdrantClient
qdrant = QdrantClient(url="http://localhost:6333")
openai_client = OpenAI()
def get_embedding(text: str) -> list[float]:
"""Generate embedding vector for a query string."""
resp = openai_client.embeddings.create(model="text-embedding-3-small", input=text)
return resp.data[0].embedding
@function_tool
def search_documents(query: str, top_k: int = 5) -> str:
"""Search indexed documents and return relevant passages with citations."""
query_vector = get_embedding(query)
results = qdrant.search(
collection_name="company_docs",
query_vector=query_vector,
limit=top_k,
)
passages = []
for point in results.points:
source = point.payload.get("source_file", "unknown")
page = point.payload.get("page_number", "?")
text = point.payload.get("text", "")
passages.append(f"[Source: {source}, p.{page}] {text}")
return "\n\n".join(passages)
@function_tool
def submit_for_review(document_id: str, reason: str) -> str:
"""Flag a document for human review."""
# In production: enqueue to review system
return f"Document {document_id} flagged for review: {reason}"
agent = Agent(
name="DocAgent",
instructions="""You answer questions about company documents.
Always cite your sources using [Source: filename, p.N] format.
If confidence is low, use submit_for_review to flag the document.""",
tools=[search_documents, submit_for_review],
)
result = Runner.run_sync(agent, "What are our payment terms for Vendor Acme Corp?")
print(result.final_output)
Use Cases (Ranked by Adoption)
1. Invoice and Purchase Order Automation
Value: Eliminate manual data entry for accounts payable; auto-match POs to invoices. Data challenges: Inconsistent vendor formats, multi-page invoices, handwritten annotations. Failure modes: Table drift (line items misaligned), currency/amount confusion, duplicate detection misses.
2. Insurance Claims Processing
Value: Accelerate claims adjudication by extracting structured data from claim forms, medical records, and police reports. Data challenges: Scanned documents with poor quality, medical terminology, multi-document claims. Failure modes: Misidentified claimant details, incorrect date parsing, missed pre-existing condition references.
3. KYC / Customer Onboarding
Value: Automate identity verification by extracting data from IDs, utility bills, and financial statements. Data challenges: International document formats, low-resolution photos, anti-fraud document checks. Failure modes: Name transliteration errors, expired document detection, false positive fraud flags.
4. Legal Discovery (eDiscovery)
Value: Search and classify large document corpora for relevance, privilege, and key facts. Data challenges: Massive volume, mixed formats, attorney-client privilege detection. Failure modes: Missed privileged documents (high risk), incorrect relevance classifications, inconsistent redaction.
5. HR Policy Q&A / Support Knowledge Copilots
Value: Let employees self-serve answers about benefits, policies, and procedures via a chatbot. Data challenges: Outdated policy documents, version control, multi-language support. Failure modes: Answering from stale policies, hallucinating benefits that don’t exist, inability to handle policy edge cases.
Evaluation + Failure Modes
Extraction Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Field-level F1 | Precision/recall per extracted field | > 0.90 for structured forms |
| Table correctness | Row/column alignment, cell content accuracy | > 0.85 for well-formatted tables |
| Layout fidelity | Reading order, heading hierarchy preservation | Manual spot-check |
QA Metrics
| Metric | What It Measures | Framework |
|---|---|---|
| Groundedness / Faithfulness | Does the answer follow from retrieved context? | DeepEval, RAGAS |
| Citation correctness | Does the cited source actually contain the claim? | Custom eval |
| Answer helpfulness | Is the answer relevant and complete? | LLM-as-a-judge (G-Eval) |
Common Failures
- Table drift: Misaligned rows when tables span pages or have merged cells.
- Scanned document quality: Low DPI, skew, or handwriting degrades extraction accuracy.
- Ambiguous fields: “Date” could be invoice date, due date, or ship date without layout context.
- Outdated policies: The vector index contains old versions of documents; retrieval returns stale information.
- Prompt injection via documents: Malicious text embedded in PDFs can influence agent behavior if not sanitized (OWASP, “LLM Top 10,” 2025).
Build vs Buy Guidance (2026 Reality Check)
When to Use a Platform
- Your team has < 2 ML engineers and needs results in < 3 months.
- Your documents follow a small set of known templates.
- You need SOC 2 / HIPAA compliance out of the box.
- Vendors like Unstructured (Platform), AWS Textract + Bedrock, Google Document AI, and Azure AI Document Intelligence offer managed services with SLAs.
When to Build (DIY)
- You have unique document formats that no vendor handles well.
- You need fine-grained control over extraction schemas, retrieval strategies, or agent behavior.
- You already have an ML platform team and evaluation infrastructure.
Team Roles and Complexity
| Component | Estimated Effort | Key Skills |
|---|---|---|
| Ingestion pipeline | 2–4 weeks | Data engineering, cloud storage |
| Extraction + parsing | 4–8 weeks | ML engineering, prompt engineering |
| Vector store + hybrid search | 2–3 weeks | Backend engineering, DB ops |
| Agent layer | 3–6 weeks | LLM engineering, API design |
| Evaluation harness | 3–4 weeks | ML evaluation, test engineering |
| Observability + compliance | 2–4 weeks | Platform engineering, security |
Total for a production system: 4–6 months with a team of 3–5 engineers. Maintenance costs include: model API fees, vector DB hosting, re-indexing pipelines, and ongoing eval dataset curation.
Top Repositories and Projects (2025–2026 Focus)
Document Parsing / Extraction
- Docling — Layout-aware document conversion supporting PDF, DOCX, PPTX, XLSX. ~52.9k GitHub stars as of Feb 2026; accepted into the LF AI & Data Foundation (Auer et al., arXiv:2408.09869, 2024; GitHub).
- MarkItDown — Microsoft’s lightweight utility converting Office docs and PDFs to Markdown for LLM pipelines. ~87k GitHub stars; built by the AutoGen team (GitHub).
- Unstructured — Open-source ETL for documents: partitioning, chunking, and cleaning for RAG. ~14k GitHub stars (GitHub).
Agent Frameworks / Orchestration
- LangGraph — Graph-based agent orchestration from LangChain. ~24.7k GitHub stars; supports cycles, persistence, and human-in-the-loop (GitHub).
- CrewAI — Multi-agent framework with role-based collaboration and Flows for event-driven control. ~44k GitHub stars; 100k+ developers certified via CrewAI courses (GitHub; crewai.com).
- OpenAI Agents SDK — Lightweight multi-agent framework with handoffs, guardrails, and built-in tracing. ~18.9k GitHub stars; released March 2025 (GitHub).
- AutoGen — Microsoft’s framework for multi-agent conversations. ~54.5k GitHub stars (GitHub).
RAG Frameworks
- LangChain — The foundational LLM application framework. ~126.6k GitHub stars; extensive retriever and vector store integrations (GitHub).
- LlamaIndex — Data framework for LLM applications with advanced indexing strategies. ~47k GitHub stars (GitHub).
Vector Databases / Hybrid Search
- Qdrant — High-performance vector search engine written in Rust. ~28.8k GitHub stars; supports hybrid search with payload filtering (GitHub).
- Chroma — Open-source AI search database. ~26.1k GitHub stars; simple 4-function API (GitHub).
- Weaviate — Cloud-native vector database with integrated model providers and hybrid search. ~15.6k GitHub stars (GitHub).
Evaluation / Observability
- DeepEval — LLM evaluation framework with RAG-specific metrics (faithfulness, answer relevancy, context precision). ~13.6k GitHub stars (GitHub).
- RAGAS — Evaluation toolkit for RAG pipelines with automated test generation. ~12.6k GitHub stars (GitHub).
Companies Building This Stack
Document Extraction Platforms
- Unstructured — Open-source library + enterprise Platform for document ETL. Provides partitioning, chunking, embedding, and connectors for 20+ data sources (unstructured.io; Unstructured Docs).
- AWS (Textract + Bedrock) — Textract provides OCR, forms, tables, and lending-specific extraction; Bedrock adds LLM-powered post-processing and agent orchestration (AWS Textract Docs, 2025; AWS Bedrock Docs, 2025).
- Google Cloud (Document AI) — Pre-trained processors for invoices, receipts, contracts, and custom extractors with human-in-the-loop review (Google Cloud Document AI Docs, 2025).
- Microsoft (Azure AI Document Intelligence) — Pre-built and custom models for document extraction, supporting 299 languages for print and handwriting (Azure AI Document Intelligence Docs, 2025).
RAG / Agent Platforms
- LangChain / LangSmith — Open-source framework + commercial observability platform for building and monitoring RAG and agent applications (langchain.com; docs.langchain.com).
- LlamaIndex / LlamaCloud — Data framework + managed ingestion/retrieval service for enterprise LLM applications (llamaindex.ai).
- OpenAI — Provides the Agents SDK, function calling, structured outputs, and built-in tracing for agent workflows (platform.openai.com).
Document AI + Workflow Automation
- Docling (IBM / LF AI & Data) — Open-source document conversion accepted into the Linux Foundation AI & Data ecosystem; integrates with LangChain and LlamaIndex (docling-project.github.io; LF AI & Data).
- Reducto — API-first document extraction with layout-aware parsing and structured output; focused on developer experience (reducto.ai).
- Sensible — No-code document extraction platform allowing teams to define extraction schemas via a visual editor and deploy to production APIs (sensible.so).
References
- Auer, C., Dolfi, M., Lysak, M., Nassar, A., Livathinos, N., Staar, P. (2024). “Docling Technical Report.” arXiv:2408.09869. https://arxiv.org/abs/2408.09869
- OpenAI (2024). “GPT-4o System Card.” https://openai.com/index/gpt-4o-system-card/
- Anthropic (2024). “Claude 3.5 Sonnet.” https://www.anthropic.com/news/claude-3-5-sonnet
- OpenAI (2024). “Structured Outputs.” https://platform.openai.com/docs/guides/structured-outputs
- OpenAI (2025). “Function Calling.” https://platform.openai.com/docs/guides/function-calling
- Anthropic (2025). “Tool Use (Claude).” https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- OpenAI (2025). “Agents SDK Documentation.” https://openai.github.io/openai-agents-python/
- LangChain (2025). “LangChain Documentation.” https://docs.langchain.com
- LangChain (2025). “LangGraph Documentation.” https://docs.langchain.com/oss/python/langgraph/overview
- AWS (2025). “Prescriptive Guidance: Agentic AI Patterns.” https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/introduction.html
- AWS (2025). “Amazon Textract Documentation.” https://docs.aws.amazon.com/textract/
- AWS (2025). “Amazon Bedrock Documentation.” https://docs.aws.amazon.com/bedrock/
- Google Cloud (2025). “Document AI Documentation.” https://cloud.google.com/document-ai/docs
- Microsoft (2025). “Azure AI Document Intelligence Documentation.” https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/
- Qdrant (2025). “Qdrant Documentation.” https://qdrant.tech/documentation/
- Weaviate (2025). “Weaviate Documentation.” https://docs.weaviate.io
- Chroma (2025). “Chroma Documentation.” https://docs.trychroma.com/
- DeepEval (2025). “DeepEval Documentation.” https://deepeval.com
- RAGAS (2025). “RAGAS Documentation.” https://docs.ragas.io
- Unstructured (2025). “Unstructured Documentation.” https://docs.unstructured.io
- OWASP (2025). “Top 10 for Large Language Model Applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/
- Microsoft (2024). “MarkItDown.” https://github.com/microsoft/markitdown
- CrewAI (2025). “CrewAI Documentation.” https://docs.crewai.com
- LF AI & Data Foundation (2025). “Projects.” https://lfaidata.foundation/projects/