GitHub - xiaomaigen/Datus-agent: The Future of Data Engineering — A CLI SQL client for the modern data stack, enabling AI-native context engineering for data.

Datus — Open-Source Data Engineering Agent

What is Datus?

Datus is an open-source data engineering agent that builds evolvable context for your data system — turning natural language into accurate SQL through domain-aware reasoning, semantic search, and continuous learning.

Data engineering is shifting from "building tables and pipelines" to "delivering scoped, domain-aware agents for analysts and business users." Datus makes that shift concrete.

Key Features

Build Evolvable Context, Not Static Pipelines

Traditional data engineering ends at data delivery. Datus goes further — it builds a living knowledge base that captures schema metadata, reference SQL, semantic models, metrics, and domain knowledge into a unified context layer. This context is what makes LLM-generated SQL accurate and trustworthy, and it improves with every interaction through a continuous learning loop. → Contextual Data Engineering

From Exploration to Domain-Specific Agents

Datus provides a complete journey for data engineers: start with a Claude-Code-like CLI to explore your data interactively, use Plan Mode to review before executing, and build up context over time. When a domain matures, package it into a Subagent — a scoped chatbot with curated context, tools, and business rules — and deliver it to analysts via web, API, or MCP. → Subagent docs

Metrics and Semantic Layer

Go beyond raw SQL with pluggable semantic adapters. Define business metrics in YAML via MetricFlow integration, and let Datus generate SQL from metric queries — bridging the gap between business language and database dialect. Use Dashboard Copilot to turn existing BI dashboards into conversational analytics. → Semantic Adapters docs

Measure and Improve

Built-in evaluation framework supporting BIRD and Spider 2.0-Snow datasets. Benchmark your agent's SQL accuracy, compare configurations, and track improvements as context evolves. → Benchmark docs

Open Platform

10+ LLM providers (OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi, OpenRouter, and more) with per-node model assignment — mix models within a single workflow
11 databases — Built-in SQLite & DuckDB, plus pluggable adapters for PostgreSQL, MySQL, Snowflake, StarRocks, ClickHouse, and more
MCP Protocol — Both an MCP server (exposing Datus tools to Claude Desktop, Cursor, etc.) and an MCP client (consuming external tools via .mcp in the CLI). → MCP docs
Skills — Extend Datus with agentskills.io-style packaged tools, configurable permissions, and marketplace support. → Skills docs

Getting Started

Install

Requirements: Python >= 3.12

pip install datus-agent
datus-agent init

datus-agent init walks you through configuring your LLM provider, database connection, and knowledge base. For detailed guidance, see the Quickstart Guide.

Four Ways to Use Datus

Interface	Command	Use Case
CLI (Interactive REPL)	`datus-cli --namespace demo`	Data engineers exploring data, building context, creating subagents
Web Chatbot (Streamlit)	`datus-cli --web --namespace demo`	Analysts chatting with subagents via browser (`http://localhost:8501`)
API Server (FastAPI)	`datus-api --namespace demo`	Applications consuming data services via REST (`http://localhost:8000`)
MCP Server	`datus-mcp --namespace demo`	MCP-compatible clients (Claude Desktop, Cursor, etc.)

Tip: Use datus-cli --print --namespace demo for JSON streaming to stdout — useful for piping into other tools.

Architecture

Workflow Engine

Datus uses a configurable node-based workflow engine. Each workflow is a plan of nodes executed in sequence, parallel, or as sub-workflows:

workflow:
  plan: planA
  planA:
    - schema_linking     # Find relevant tables
    - parallel:          # Run in parallel
      - generate_sql     # SQL generation
      - reasoning        # Chain-of-thought reasoning
    - selection          # Pick the best result
    - execute_sql        # Run the query
    - output             # Format and return

Node Types

Category	Nodes
Core	`schema_linking`, `generate_sql`, `execute_sql`, `reasoning`, `reflect`, `output`
Agentic	`chat`, `explore`, `gen_semantic_model`, `gen_metrics`, `gen_ext_knowledge`, `gen_sql_summary`, `gen_skill`, `gen_table`, `compare`
Control Flow	`parallel`, `selection`, `subworkflow`
Utility	`date_parser`, `doc_search`, `fix`

RAG Knowledge Base

The knowledge base is powered by LanceDB and organizes context into multiple layers:

Schema Metadata — Table and column descriptions, relationships
Reference SQL — Curated query examples with summaries
Reference Templates — Parameterized Jinja2 SQL templates for stable, reusable queries
Semantic Models — Business logic and metric definitions
Metrics — Executable business metrics via semantic layer integration
External Knowledge — Domain rules and concepts beyond raw schema
Platform Docs — Ingested from GitHub repos, websites, or local files

Build the knowledge base with:

datus-agent bootstrap-kb --namespace demo --components metadata,reference_sql,ext_knowledge

Configuration

Datus is configured via agent.yml. Run datus-agent init to generate a starter config, or see conf/agent.yml.example for all options.

Section	Purpose
`agent.models`	LLM provider definitions (API keys, model IDs, base URLs)
`agent.nodes`	Per-node model assignment and tuning parameters
`agent.namespace`	Database connections (SQLite, DuckDB, Snowflake, etc.)
`agent.storage`	Embedding models, vector DB, and RAG configuration
`agent.workflow`	Execution plans with sequential, parallel, and sub-workflow steps
`agent.agentic_nodes`	Configuration for agentic nodes (semantic model gen, metrics gen)
`agent.document`	Platform documentation sources (GitHub repos, websites, local files)

API keys are injected via environment variables using ${ENV_VAR} syntax.

Supported LLM Providers

Provider	Type	Notes
OpenAI	`openai`	GPT-4o, GPT-4, etc.
Anthropic Claude	`claude`	Direct API
Google Gemini	`gemini`	Gemini 2.0+
DeepSeek	`deepseek`	DeepSeek-Chat, DeepSeek-Coder
Alibaba Qwen	`qwen`	Qwen series
Moonshot Kimi	`kimi`	Kimi models
MiniMax	`minimax`	MiniMax models
GLM (Zhipu)	`glm`	GLM-4 series
OpenAI Codex	`codex`	OAuth-based Codex models (gpt-5.3-codex, o3-codex)
OpenRouter	`openrouter`	300+ models via a single API key

Embedding models: OpenAI, Sentence-Transformers, FastEmbed, Hugging Face.

Per-node model assignment lets you use different providers for different workflow steps (e.g., a cheaper model for schema linking, a stronger model for SQL generation).

Supported Databases

Database	Type	Package
SQLite	`sqlite`	Built-in
DuckDB	`duckdb`	Built-in
PostgreSQL	`postgresql`	`datus-postgresql`
MySQL	`mysql`	`datus-mysql`
Snowflake	`snowflake`	`datus-snowflake`
StarRocks	`starrocks`	`datus-starrocks`
ClickHouse	`clickhouse`	`datus-clickhouse`
ClickZetta	`clickzetta`	`datus-clickzetta`
Hive	`hive`	`datus-hive`
Spark	`spark`	`datus-spark`
Trino	`trino`	`datus-trino`

See Database Adapters documentation for details.

How It Works

Explore — Chat with your database, test queries, and ground prompts with @table or @file references.

datus-cli --namespace demo
/Check the top 10 banks by assets lost @table duckdb-demo.main.bank_failures

Build Context — Generate semantic models, import SQL history, define metrics. Each piece becomes reusable context for future queries.

/gen_semantic_model xxx        # Generate semantic model from tables
/gen_sql_summary               # Index SQL history for retrieval

Create a Subagent — Package mature context into a scoped, domain-aware chatbot with curated tools and business rules.

.subagent add mychatbot        # Create a new subagent

Deliver — Serve the subagent to analysts via web (localhost:8501/?subagent=mychatbot), REST API, or MCP — with feedback collection (upvotes, issue reports) built in.

Measure — Run benchmarks against BIRD or Spider 2.0-Snow to track SQL accuracy as context evolves.

Iterate — Analyst feedback loops back: engineers fix SQL, add rules, refine semantic models, and extend with Skills or MCP tools. The agent gets more accurate over time.

→ End-to-end tutorial · CLI docs · Knowledge Base docs · Subagent docs

Development

uv sync                                           # Install dependencies
uv run pytest tests/unit_tests/ -q                # Run CI tests (no external deps)
uv run ruff format . && uv run ruff check --fix . # Lint & format

Enable --save_llm_trace on CLI commands or set save_llm_trace: true per model in agent.yml to persist LLM inputs/outputs for debugging. → LLM Trace docs

See CLAUDE.md for full development conventions, architecture patterns, and testing rules.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 423 Commits
.github		.github
benchmark		benchmark
build_scripts		build_scripts
ci		ci
conf		conf
datus		datus
docs		docs
sample_data		sample_data
scripts		scripts
skills		skills
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
BUILD.md		BUILD.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-docs.txt		requirements-docs.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
uv.toml		uv.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is Datus?

Key Features

Build Evolvable Context, Not Static Pipelines

From Exploration to Domain-Specific Agents

Metrics and Semantic Layer

Measure and Improve

Open Platform

Getting Started

Install

Four Ways to Use Datus

Architecture

Workflow Engine

Node Types

RAG Knowledge Base

Configuration

Supported LLM Providers

Supported Databases

How It Works

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What is Datus?

Key Features

Build Evolvable Context, Not Static Pipelines

From Exploration to Domain-Specific Agents

Metrics and Semantic Layer

Measure and Improve

Open Platform

Getting Started

Install

Four Ways to Use Datus

Architecture

Workflow Engine

Node Types

RAG Knowledge Base

Configuration

Supported LLM Providers

Supported Databases

How It Works

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages