Skip to content
yisusvii Blog
Go back

Top GitHub Repositories for AI Agentic Frameworks and Document Extraction in 2026

Suggest Changes

The AI landscape in 2026 is defined by two converging trends: agentic architectures that let LLMs reason, plan, and use tools autonomously, and document extraction pipelines that turn messy real-world files into structured data those agents can act on. If you are evaluating frameworks for a new project—or just trying to keep up—this post maps the most popular open-source repositories across three categories: core agentic frameworks, enterprise-grade agentic AI platforms, and document extraction tools, ranked by GitHub stars and forks as of February 2026.

Why This List Matters

Stars and forks are imperfect proxies, but they signal community adoption, ecosystem breadth, and long-term viability. A framework with 50k+ stars typically has battle-tested integrations, active maintainers, and a deep pool of community knowledge. For engineering teams making architectural bets, these numbers matter.


Part 1 — AI Agentic Core Frameworks

These repositories provide the building blocks for creating autonomous agents: tool use, multi-agent orchestration, memory, planning, and guardrails.

Tier 1: The Giants (50k+ ⭐)

RepositoryStarsForksDescription
langchain-ai/langchain~127.7k~21kThe platform for reliable agents. LangChain has evolved from a simple chain-of-prompts library into a full agent orchestration ecosystem. Its strength is breadth: hundreds of integrations with LLM providers, vector stores, and tools.
FoundationAgents/MetaGPT~64.6k~8.1kThe Multi-Agent Framework that simulates a software company. MetaGPT assigns roles (Product Manager, Architect, Engineer) to different agents that collaborate on complex tasks, pioneering the “AI software company” pattern.
microsoft/autogen~54.9k~8.3kMicrosoft’s programming framework for agentic AI. AutoGen pioneered the conversational agent pattern where multiple agents debate and refine solutions. It now supports a full ecosystem of extensions and studio tools.

Tier 2: Established Leaders (20k–50k ⭐)

RepositoryStarsForksDescription
run-llama/llama_index~47.3k~6.9kThe leading data framework for LLM applications. LlamaIndex excels at connecting custom data sources to LLMs with sophisticated indexing, retrieval, and query engine abstractions.
crewAIInc/crewAI~44.8k~6kFramework for orchestrating role-playing, autonomous AI agents. CrewAI makes it simple to define agent roles, goals, and backstories, then let them collaborate on tasks—ideal for business workflow automation.
agno-agi/agno~38.3k~5.1kBuild, run, and manage agentic software at scale. Agno (formerly Phidata) focuses on production-ready agent deployment with built-in memory, knowledge bases, and tool integrations.
CopilotKit/CopilotKit~29.1k~3.8kThe frontend for agents and generative UI. CopilotKit bridges the gap between backend agent logic and React/Angular UIs with in-app AI copilots, chat interfaces, and agentic actions.
huggingface/smolagents~25.7k~2.3kA barebones library for agents that think in code. From Hugging Face, smolagents is refreshingly minimal—agents write and execute Python code as their reasoning medium instead of JSON tool calls.
langchain-ai/langgraph~25.2k~4.4kBuild resilient language agents as graphs. LangGraph models agent workflows as state machines with cycles, branching, and human-in-the-loop checkpoints—the go-to choice for complex, stateful agent pipelines.
deepset-ai/haystack~24.3k~2.6kOpen-source AI orchestration framework for production-ready LLM applications. Haystack’s modular pipeline design supports retrieval, routing, memory, and generation with explicit control at every step.
mastra-ai/mastra~21.5k~1.6kFrom the team behind Gatsby—a TypeScript-first framework for building AI agents. Mastra is the top choice for JavaScript/TypeScript teams wanting first-class agent support with workflows, evals, and MCP integration.
openai/swarm~21k~2.2kEducational framework exploring lightweight multi-agent orchestration. OpenAI’s Swarm demonstrates ergonomic patterns for agent handoffs and routines—minimal but influential in shaping how people think about agent coordination.

Tier 3: Rising Stars (10k–20k ⭐)

RepositoryStarsForksDescription
openai/openai-agents-python~19.2k~3.2kOpenAI’s official lightweight framework for multi-agent workflows. Provides primitives for agents, handoffs, guardrails, and tracing—designed to be the successor to Swarm for production use.
google/adk-python~18k~3kGoogle’s Agent Development Kit. A code-first Python toolkit for building, evaluating, and deploying AI agents with deep Gemini integration and support for multi-agent collaboration.
eosphoros-ai/DB-GPT~18.2k~2.6kAI Native Data App Development framework. DB-GPT combines an Agentic Workflow Expression Language (AWEL) with RAG, private LLM support, and database interaction for data-centric AI applications.
TransformerOptimus/SuperAGI~17.2k~2.2kA dev-first open-source autonomous AI agent framework. SuperAGI provides infrastructure to build, manage, and run autonomous agents with a marketplace of tools and agent templates.
pydantic/pydantic-ai~15.1k~1.7kGenAI Agent Framework, the Pydantic way. Built by the creators of Pydantic, it brings type-safe, validated, structured outputs to agent development with dependency injection and Logfire observability.

How to Choose an Agentic Framework

The choice depends on your team’s language, complexity needs, and production requirements:


Part 2 — Agentic AI Architecture Frameworks for Enterprises

These repositories are built with enterprise production requirements in mind: security, scalability, multi-model flexibility, governance, and integration with existing corporate tooling.

Tier 1: Enterprise Platforms (50k+ ⭐)

RepositoryStarsForksDescription
langgenius/dify~84k~12kThe leading open-source LLM application development platform. Dify provides a complete enterprise stack: visual workflow builder, RAG pipeline, agent orchestration, model management, observability, and fine-tuning—all in one deployable package. Used by thousands of enterprises for internal AI apps.

Tier 2: Enterprise SDKs and Gateways (10k–50k ⭐)

RepositoryStarsForksDescription
microsoft/semantic-kernel~24.7k~3.7kMicrosoft’s enterprise SDK for integrating LLMs into applications. Semantic Kernel provides enterprise patterns—plugin architecture, memory management, planner, process framework—with first-class support for Azure OpenAI, C#, Python, and Java.
FlowiseAI/Flowise~34k~18kDrag-and-drop UI for building LLM flows and agents. Flowise lowers the bar for enterprise teams to prototype and deploy agent workflows without deep Python expertise, with a self-hosted option and API exposure.
BerriAI/litellm~21k~2.5kCall all LLM APIs using the OpenAI format. LiteLLM acts as an enterprise LLM gateway: unified API across 100+ providers, cost tracking, rate limiting, spend controls, fallback routing, and centralized logging—critical for enterprises managing multiple LLM contracts.
n8n-io/n8n~100k~28kWorkflow automation for technical teams. n8n enables enterprise teams to wire AI agents into business processes—CRM updates, Slack notifications, database writes—with 400+ integrations and a self-hosted deployment model that satisfies enterprise data-residency requirements.

How to Choose an Enterprise Agentic Framework


Part 3 — AI Document Extraction

These repositories solve the upstream problem: getting structured, LLM-ready data out of PDFs, Office documents, images, and scans.

Tier 1: The Dominant Tools (50k+ ⭐)

RepositoryStarsForksDescription
microsoft/markitdown~87.8k~5.1kPython tool for converting files and Office documents to Markdown. MarkItDown is stunningly popular because it solves a universal need: turn anything (PDF, DOCX, PPTX, XLSX, images, audio, HTML) into clean Markdown for LLM consumption. Integrates natively with AutoGen and LangChain.
opendatalab/MinerU~55.1k~4.6kTransforms complex documents into LLM-ready Markdown/JSON for agentic workflows. MinerU excels at layout analysis, table detection, and formula recognition—purpose-built for RAG and agent pipelines.
docling-project/docling~54.4k~3.7kIBM’s document parser for gen AI. Docling handles PDF, DOCX, PPTX, XLSX, images, and HTML with deep understanding of document structure: headers, tables, figures, and reading order. A cornerstone of many enterprise RAG systems.

Tier 2: Essential Utilities (10k–50k ⭐)

RepositoryStarsForksDescription
Unstructured-IO/unstructured~14.1k~1.2kOpen-source ETL for transforming complex documents into clean, structured formats. Unstructured provides a unified API across dozens of file types with built-in OCR, chunking, and embedding support—the “Swiss Army knife” of document pre-processing.

Tier 3: Emerging & Specialized (< 10k ⭐)

RepositoryStarsForksDescription
shcherbak-ai/contextgem~1.8k~146Effortless LLM extraction from documents. ContextGem focuses on schema-driven extraction—define what you want, and the LLM extracts it with validation. Great for contract analysis and legal tech.
NanoNets/docstrange~1.4k~124Extract and convert data from any document into Markdown, JSON, CSV, or HTML. DocStrange adds intelligent structured data extraction with advanced OCR on top of standard conversion.

How to Choose a Document Extraction Tool


The Full Stack: Extraction + Agents

The real power emerges when you combine both layers. Here is a reference architecture that stitches them together:

┌────────────────────────────────────────────────────────────┐
│                    Document Sources                         │
│  (PDFs, DOCX, images, scans, emails, web pages)           │
└────────────────────┬───────────────────────────────────────┘


┌────────────────────────────────────────────────────────────┐
│              Extraction Layer                               │
│  MarkItDown / Docling / MinerU / Unstructured              │
│  → Markdown/JSON with tables, metadata, structure          │
└────────────────────┬───────────────────────────────────────┘


┌────────────────────────────────────────────────────────────┐
│              Indexing & Storage                             │
│  Chunking → Embeddings → Vector Store (pgvector/Pinecone)  │
│  + Metadata in relational DB (Postgres)                    │
└────────────────────┬───────────────────────────────────────┘


┌────────────────────────────────────────────────────────────┐
│              Agent Layer                                    │
│  LangGraph / CrewAI / OpenAI Agents SDK / Agno             │
│  → Retrieval → Reasoning → Tool Calls → Citations          │
└────────────────────┬───────────────────────────────────────┘


┌────────────────────────────────────────────────────────────┐
│              User Interface                                 │
│  CopilotKit / Mastra / Custom React App                    │
│  → Chat, dashboards, workflow triggers                     │
└────────────────────────────────────────────────────────────┘

A Minimal Example: Docling + LangGraph

Here is a simplified example of how you might wire Docling’s extraction into a LangGraph agent:

from docling.document_converter import DocumentConverter
from langgraph.graph import StateGraph, END

# Step 1: Extract document content
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
markdown_content = result.document.export_to_markdown()

# Step 2: Define agent state
class AgentState(TypedDict):
    documents: list[str]
    question: str
    answer: str

# Step 3: Build the agent graph
def retrieve(state: AgentState) -> AgentState:
    # Embed and search the extracted content
    relevant_chunks = vector_store.similarity_search(state["question"])
    return {**state, "documents": relevant_chunks}

def reason(state: AgentState) -> AgentState:
    # LLM reasons over retrieved chunks
    answer = llm.invoke(
        f"Based on these documents: {state['documents']}\n"
        f"Answer: {state['question']}"
    )
    return {**state, "answer": answer}

graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve)
graph.add_node("reason", reason)
graph.add_edge("retrieve", "reason")
graph.add_edge("reason", END)
agent = graph.compile()

  1. MCP (Model Context Protocol) everywhere. Anthropic’s MCP standard is rapidly becoming the universal interface between agents and external tools/data. Frameworks like Mastra, LangGraph, and Google ADK already support it natively.

  2. Multi-agent is going mainstream. CrewAI, AutoGen, and OpenAI Agents SDK all default to multi-agent patterns. Single-agent RAG chatbots are being replaced by specialized agent teams.

  3. Extraction quality is the new bottleneck. With agent frameworks maturing, the quality of upstream document extraction is now the primary differentiator. MinerU, Docling, and MarkItDown are investing heavily in table, formula, and layout accuracy.

  4. TypeScript is catching up. Mastra, CopilotKit, and LangGraphJS are giving JavaScript/TypeScript teams first-class agent tooling, closing the gap with the Python ecosystem.

  5. Observability is non-negotiable. Every major framework now ships with tracing, logging, and evaluation hooks. Production agent systems require the same observability rigor as traditional microservices.

Conclusion

The open-source ecosystem for AI agents and document extraction has exploded in both depth and quality. Whether you are building an invoice processing pipeline, a compliance chatbot, or a full autonomous research assistant, the tools listed here represent the best starting points in 2026. Pick your extraction layer, choose your agent framework, wire them together—and ship.


Star counts reflect approximate values as of February 28, 2026. Follow my blog for more insights on AI architecture and agentic systems.


Suggest Changes
Share this post on:

Previous Post
Top Tech Publications and Developer Resources to Follow in 2026
Next Post
Why Postgres is the Only Database You Need for AI