Top GitHub Repositories for AI Agentic Frameworks and Document Extraction in 2026

The AI landscape in 2026 is defined by two converging trends: agentic architectures that let LLMs reason, plan, and use tools autonomously, and document extraction pipelines that turn messy real-world files into structured data those agents can act on. If you are evaluating frameworks for a new project—or just trying to keep up—this post maps the most popular open-source repositories across three categories: core agentic frameworks, enterprise-grade agentic AI platforms, and document extraction tools, ranked by GitHub stars and forks as of February 2026.

Why This List Matters

Stars and forks are imperfect proxies, but they signal community adoption, ecosystem breadth, and long-term viability. A framework with 50k+ stars typically has battle-tested integrations, active maintainers, and a deep pool of community knowledge. For engineering teams making architectural bets, these numbers matter.

Part 1 — AI Agentic Core Frameworks

These repositories provide the building blocks for creating autonomous agents: tool use, multi-agent orchestration, memory, planning, and guardrails.

Tier 1: The Giants (50k+ ⭐)

Repository	Stars	Forks	Description
langchain-ai/langchain	~127.7k	~21k	The platform for reliable agents. LangChain has evolved from a simple chain-of-prompts library into a full agent orchestration ecosystem. Its strength is breadth: hundreds of integrations with LLM providers, vector stores, and tools.
FoundationAgents/MetaGPT	~64.6k	~8.1k	The Multi-Agent Framework that simulates a software company. MetaGPT assigns roles (Product Manager, Architect, Engineer) to different agents that collaborate on complex tasks, pioneering the “AI software company” pattern.
microsoft/autogen	~54.9k	~8.3k	Microsoft’s programming framework for agentic AI. AutoGen pioneered the conversational agent pattern where multiple agents debate and refine solutions. It now supports a full ecosystem of extensions and studio tools.

Tier 2: Established Leaders (20k–50k ⭐)

Repository	Stars	Forks	Description
run-llama/llama_index	~47.3k	~6.9k	The leading data framework for LLM applications. LlamaIndex excels at connecting custom data sources to LLMs with sophisticated indexing, retrieval, and query engine abstractions.
crewAIInc/crewAI	~44.8k	~6k	Framework for orchestrating role-playing, autonomous AI agents. CrewAI makes it simple to define agent roles, goals, and backstories, then let them collaborate on tasks—ideal for business workflow automation.
agno-agi/agno	~38.3k	~5.1k	Build, run, and manage agentic software at scale. Agno (formerly Phidata) focuses on production-ready agent deployment with built-in memory, knowledge bases, and tool integrations.
CopilotKit/CopilotKit	~29.1k	~3.8k	The frontend for agents and generative UI. CopilotKit bridges the gap between backend agent logic and React/Angular UIs with in-app AI copilots, chat interfaces, and agentic actions.
huggingface/smolagents	~25.7k	~2.3k	A barebones library for agents that think in code. From Hugging Face, smolagents is refreshingly minimal—agents write and execute Python code as their reasoning medium instead of JSON tool calls.
langchain-ai/langgraph	~25.2k	~4.4k	Build resilient language agents as graphs. LangGraph models agent workflows as state machines with cycles, branching, and human-in-the-loop checkpoints—the go-to choice for complex, stateful agent pipelines.
deepset-ai/haystack	~24.3k	~2.6k	Open-source AI orchestration framework for production-ready LLM applications. Haystack’s modular pipeline design supports retrieval, routing, memory, and generation with explicit control at every step.
mastra-ai/mastra	~21.5k	~1.6k	From the team behind Gatsby—a TypeScript-first framework for building AI agents. Mastra is the top choice for JavaScript/TypeScript teams wanting first-class agent support with workflows, evals, and MCP integration.
openai/swarm	~21k	~2.2k	Educational framework exploring lightweight multi-agent orchestration. OpenAI’s Swarm demonstrates ergonomic patterns for agent handoffs and routines—minimal but influential in shaping how people think about agent coordination.

Tier 3: Rising Stars (10k–20k ⭐)

Repository	Stars	Forks	Description
openai/openai-agents-python	~19.2k	~3.2k	OpenAI’s official lightweight framework for multi-agent workflows. Provides primitives for agents, handoffs, guardrails, and tracing—designed to be the successor to Swarm for production use.
google/adk-python	~18k	~3k	Google’s Agent Development Kit. A code-first Python toolkit for building, evaluating, and deploying AI agents with deep Gemini integration and support for multi-agent collaboration.
eosphoros-ai/DB-GPT	~18.2k	~2.6k	AI Native Data App Development framework. DB-GPT combines an Agentic Workflow Expression Language (AWEL) with RAG, private LLM support, and database interaction for data-centric AI applications.
TransformerOptimus/SuperAGI	~17.2k	~2.2k	A dev-first open-source autonomous AI agent framework. SuperAGI provides infrastructure to build, manage, and run autonomous agents with a marketplace of tools and agent templates.
pydantic/pydantic-ai	~15.1k	~1.7k	GenAI Agent Framework, the Pydantic way. Built by the creators of Pydantic, it brings type-safe, validated, structured outputs to agent development with dependency injection and Logfire observability.

How to Choose an Agentic Framework

The choice depends on your team’s language, complexity needs, and production requirements:

Python-first, maximum ecosystem → LangChain + LangGraph
Role-based multi-agent collaboration → CrewAI or MetaGPT
Microsoft ecosystem / conversational agents → AutoGen
Production-ready with batteries included → Agno
TypeScript / frontend integration → Mastra or CopilotKit
Minimal, code-as-reasoning → smolagents
Type-safe, Pydantic-native → PydanticAI
Google / Gemini ecosystem → Google ADK
OpenAI-native production workflows → OpenAI Agents SDK

Part 2 — Agentic AI Architecture Frameworks for Enterprises

These repositories are built with enterprise production requirements in mind: security, scalability, multi-model flexibility, governance, and integration with existing corporate tooling.

Tier 1: Enterprise Platforms (50k+ ⭐)

Repository	Stars	Forks	Description
langgenius/dify	~84k	~12k	The leading open-source LLM application development platform. Dify provides a complete enterprise stack: visual workflow builder, RAG pipeline, agent orchestration, model management, observability, and fine-tuning—all in one deployable package. Used by thousands of enterprises for internal AI apps.

Tier 2: Enterprise SDKs and Gateways (10k–50k ⭐)

Repository	Stars	Forks	Description
microsoft/semantic-kernel	~24.7k	~3.7k	Microsoft’s enterprise SDK for integrating LLMs into applications. Semantic Kernel provides enterprise patterns—plugin architecture, memory management, planner, process framework—with first-class support for Azure OpenAI, C#, Python, and Java.
FlowiseAI/Flowise	~34k	~18k	Drag-and-drop UI for building LLM flows and agents. Flowise lowers the bar for enterprise teams to prototype and deploy agent workflows without deep Python expertise, with a self-hosted option and API exposure.
BerriAI/litellm	~21k	~2.5k	Call all LLM APIs using the OpenAI format. LiteLLM acts as an enterprise LLM gateway: unified API across 100+ providers, cost tracking, rate limiting, spend controls, fallback routing, and centralized logging—critical for enterprises managing multiple LLM contracts.
n8n-io/n8n	~100k	~28k	Workflow automation for technical teams. n8n enables enterprise teams to wire AI agents into business processes—CRM updates, Slack notifications, database writes—with 400+ integrations and a self-hosted deployment model that satisfies enterprise data-residency requirements.

How to Choose an Enterprise Agentic Framework

Full-stack internal AI platform → Dify
Microsoft / Azure ecosystem → Semantic Kernel
No-code / low-code agent builder → Flowise
Multi-provider LLM governance and cost control → LiteLLM
AI-augmented business process automation → n8n

Part 3 — AI Document Extraction

These repositories solve the upstream problem: getting structured, LLM-ready data out of PDFs, Office documents, images, and scans.

Tier 1: The Dominant Tools (50k+ ⭐)

Repository	Stars	Forks	Description
microsoft/markitdown	~87.8k	~5.1k	Python tool for converting files and Office documents to Markdown. MarkItDown is stunningly popular because it solves a universal need: turn anything (PDF, DOCX, PPTX, XLSX, images, audio, HTML) into clean Markdown for LLM consumption. Integrates natively with AutoGen and LangChain.
opendatalab/MinerU	~55.1k	~4.6k	Transforms complex documents into LLM-ready Markdown/JSON for agentic workflows. MinerU excels at layout analysis, table detection, and formula recognition—purpose-built for RAG and agent pipelines.
docling-project/docling	~54.4k	~3.7k	IBM’s document parser for gen AI. Docling handles PDF, DOCX, PPTX, XLSX, images, and HTML with deep understanding of document structure: headers, tables, figures, and reading order. A cornerstone of many enterprise RAG systems.

Tier 2: Essential Utilities (10k–50k ⭐)

Repository	Stars	Forks	Description
Unstructured-IO/unstructured	~14.1k	~1.2k	Open-source ETL for transforming complex documents into clean, structured formats. Unstructured provides a unified API across dozens of file types with built-in OCR, chunking, and embedding support—the “Swiss Army knife” of document pre-processing.

Tier 3: Emerging & Specialized (< 10k ⭐)

Repository	Stars	Forks	Description
shcherbak-ai/contextgem	~1.8k	~146	Effortless LLM extraction from documents. ContextGem focuses on schema-driven extraction—define what you want, and the LLM extracts it with validation. Great for contract analysis and legal tech.
NanoNets/docstrange	~1.4k	~124	Extract and convert data from any document into Markdown, JSON, CSV, or HTML. DocStrange adds intelligent structured data extraction with advanced OCR on top of standard conversion.

How to Choose a Document Extraction Tool

Universal file-to-Markdown conversion → MarkItDown
Complex PDFs with tables, formulas, and layouts → MinerU or Docling
Full ETL pipeline with chunking + embeddings → Unstructured
Schema-driven LLM extraction → ContextGem
Multi-format structured output → DocStrange

The Full Stack: Extraction + Agents

The real power emerges when you combine both layers. Here is a reference architecture that stitches them together:

┌────────────────────────────────────────────────────────────┐
│                    Document Sources                         │
│  (PDFs, DOCX, images, scans, emails, web pages)           │
└────────────────────┬───────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────┐
│              Extraction Layer                               │
│  MarkItDown / Docling / MinerU / Unstructured              │
│  → Markdown/JSON with tables, metadata, structure          │
└────────────────────┬───────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────┐
│              Indexing & Storage                             │
│  Chunking → Embeddings → Vector Store (pgvector/Pinecone)  │
│  + Metadata in relational DB (Postgres)                    │
└────────────────────┬───────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────┐
│              Agent Layer                                    │
│  LangGraph / CrewAI / OpenAI Agents SDK / Agno             │
│  → Retrieval → Reasoning → Tool Calls → Citations          │
└────────────────────┬───────────────────────────────────────┘
                     │
                     ▼
┌────────────────────────────────────────────────────────────┐
│              User Interface                                 │
│  CopilotKit / Mastra / Custom React App                    │
│  → Chat, dashboards, workflow triggers                     │
└────────────────────────────────────────────────────────────┘

A Minimal Example: Docling + LangGraph

Here is a simplified example of how you might wire Docling’s extraction into a LangGraph agent:

from docling.document_converter import DocumentConverter
from langgraph.graph import StateGraph, END

# Step 1: Extract document content
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
markdown_content = result.document.export_to_markdown()

# Step 2: Define agent state
class AgentState(TypedDict):
    documents: list[str]
    question: str
    answer: str

# Step 3: Build the agent graph
def retrieve(state: AgentState) -> AgentState:
    # Embed and search the extracted content
    relevant_chunks = vector_store.similarity_search(state["question"])
    return {**state, "documents": relevant_chunks}

def reason(state: AgentState) -> AgentState:
    # LLM reasons over retrieved chunks
    answer = llm.invoke(
        f"Based on these documents: {state['documents']}\n"
        f"Answer: {state['question']}"
    )
    return {**state, "answer": answer}

graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve)
graph.add_node("reason", reason)
graph.add_edge("retrieve", "reason")
graph.add_edge("reason", END)
agent = graph.compile()

Trends to Watch

MCP (Model Context Protocol) everywhere. Anthropic’s MCP standard is rapidly becoming the universal interface between agents and external tools/data. Frameworks like Mastra, LangGraph, and Google ADK already support it natively.
Multi-agent is going mainstream. CrewAI, AutoGen, and OpenAI Agents SDK all default to multi-agent patterns. Single-agent RAG chatbots are being replaced by specialized agent teams.
Extraction quality is the new bottleneck. With agent frameworks maturing, the quality of upstream document extraction is now the primary differentiator. MinerU, Docling, and MarkItDown are investing heavily in table, formula, and layout accuracy.
TypeScript is catching up. Mastra, CopilotKit, and LangGraphJS are giving JavaScript/TypeScript teams first-class agent tooling, closing the gap with the Python ecosystem.
Observability is non-negotiable. Every major framework now ships with tracing, logging, and evaluation hooks. Production agent systems require the same observability rigor as traditional microservices.

Conclusion

The open-source ecosystem for AI agents and document extraction has exploded in both depth and quality. Whether you are building an invoice processing pipeline, a compliance chatbot, or a full autonomous research assistant, the tools listed here represent the best starting points in 2026. Pick your extraction layer, choose your agent framework, wire them together—and ship.

Star counts reflect approximate values as of February 28, 2026. Follow my blog for more insights on AI architecture and agentic systems.