Deep research tools answer questions. This builds the tools.
Create domain-specialized deep-research engines as standalone Claude Code plugins. This meta-plugin interviews you about your research domain, then generates a complete Claude Code plugin with a custom multi-agent research pipeline, domain-specific source hierarchies, quality frameworks, and output templates. You describe what you research; it builds the tooling.
I've used deep research features across ChatGPT, Claude, and Gemini extensively. They're impressive -- and they all hit the same ceiling. They don't know which sources matter in your field. They can't tell the difference between a tier-1 regulatory filing and a blog post summarizing it. They apply the same generic quality bar to patent analysis, AML compliance, and biotech pipeline reviews as if those were the same discipline. They're not.
The phrase "all-purpose deep research engine" is, to me, an oxymoron. Real research depth requires domain knowledge -- knowing which databases to check first, what evidence thresholds apply, which citation standards your audience expects, and how to weigh conflicting sources against each other. A tool that researches everything "deeply" is really researching everything at the same shallow-expert level.
So I built this. Not a single domain engine, but a factory that generates them. You describe your research domain -- the sources you trust, the agents you need, the quality standards your work demands, the report structure your audience expects -- and it produces a fully self-contained research plugin tailored to that domain. A patent attorney gets patent-grade research tooling. A compliance officer gets AML-grade screening. An investigative journalist gets evidence-chain documentation. Each engine knows its own field because you taught it yours.
The general-purpose deep research tools opened the door. This walks through it.
| Perplexity Deep Research | ChatGPT Deep Research | Gemini Deep Research | Claude Research | Engine Creator | |
|---|---|---|---|---|---|
| Domain-specific source hierarchy | 3 domain filters max | Site-restricted search | Google Search + Drive/Gmail | Web + Google Workspace | 5-tier credibility hierarchy, unlimited domains per tier, fully customizable |
| Multi-agent architecture | Single pipeline | Single pipeline | Single pipeline | Single pipeline | Configurable multi-agent teams with per-agent specialization, model, and tools |
| Post-report verification (VVC) | Citations only (37% failure rate -- Tow Center) | Citations only (89% of incorrect citations stated with confidence) | Citations only (~22% misattribution rate -- PIES) | Citations only (no post-draft re-verification) | Claim verification: extracts every factual claim, re-fetches the cited source, checks credibility AND accurate representation, auto-corrects errors. Citations can hallucinate. Verified claims can't. |
| Quality framework | Generic | Generic | Generic | Generic | Configurable confidence scoring, evidence thresholds, validation rules, citation standards |
| Report structure | Fixed format | Fixed (with export to MD/PDF/Word) | Fixed (with Canvas/Audio) | Fixed | Fully customizable sections, deliverables, and naming per domain |
| Reproducibility | Ephemeral | Ephemeral | Ephemeral | Ephemeral | Versionable engine-config.json -- same config = same pipeline |
| Transparency | Closed source | Closed source | Closed source | Closed source | Fully open source -- every prompt, rule, and template is readable and editable |
| Domain presets | -- | -- | -- | -- | 21 presets (Legal, OSINT, CRE, AML, AI/Agentic, etc.) + custom from scratch |
| Ownership | SaaS | SaaS | SaaS | SaaS | Self-contained plugins you own, version-control, and share |
| Setup required | None | None | None | None | Claude Code + plugin install + wizard |
| Built-in search index | Proprietary crawler | Bing | Web search | Relies on Claude Code tools | |
| Free tier | 3/day | Limited | Limited | Limited | Requires Claude Code subscription |
The tradeoff is intentional. The SaaS tools optimize for zero-setup convenience. This plugin optimizes for domain depth, verification rigor, and professional control. If you need a quick answer to a general question, use Perplexity. If you need research where every factual claim is checked for source credibility and accurate representation -- not just decorated with a URL -- build an engine.
Why verification matters more than citations: A Columbia University study found AI search engines fail to produce accurate citations more than 60% of the time. The citation looks real -- a clickable URL to a real page -- but the claim attached to it may not appear in that source at all. Researchers call this "hallucination laundering." Specific findings: Perplexity failed 37% of citation tasks (best in class). ChatGPT presented incorrect citations with confidence 89% of the time. Gemini misattributed ~22% of explicit claims. Claude hallucinated metadata (author, title) even when the cited URL was real. None of these platforms re-fetch sources after drafting to verify accuracy. VVC does: it extracts every factual claim, re-fetches the cited source, and answers two questions: (1) Is this source credible for this claim? (2) Was the source accurately represented? Claims that fail are auto-corrected or flagged. Citations create confidence. Verification earns it.
Sources: Tow Center / Columbia Journalism Review (2025) · GPTZero: Second-Hand Hallucinations · PIES Hallucination Taxonomy (arXiv) · Deakin University Citation Study · TechCrunch: Claude Legal Citation Incident
A growing number of benchmarks now evaluate deep research agents. The results paint a consistent picture -- even the best systems have significant room to improve:
| Benchmark | Creator | Key Finding |
|---|---|---|
| DRACO | Perplexity | Best system scores 70.5%. Strongest in Law (90.2%), weakest in general knowledge |
| DeepResearch Bench II | Independent (academic) | Even the strongest models satisfy fewer than 50% of expert rubrics across 9,430 criteria |
| DeepScholar-Bench | Stanford-adjacent | No system exceeds 31% geometric mean across synthesis, retrieval, and verifiability |
| ReportBench | ByteDance | Citation hallucination (fabricated references) identified as a distinct and persistent failure type |
| DeepSearchQA | Google DeepMind | 900-prompt benchmark; best system achieves 66.1% on multi-source collation tasks |
| DeepHalluBench | Academic | First trajectory-level hallucination evaluation; finds errors cascade from early research stages |
What these benchmarks don't measure -- and what generated engines address by default:
- Post-draft self-verification. Every benchmark evaluates the final report as-is. No benchmark tests whether the system checks its own work. VVC (enabled by default) adds this layer.
- Source credibility ranking. No benchmark evaluates whether the system distinguishes tier-1 regulatory filings from tier-5 blog posts. The 5-tier credibility hierarchy (configured per engine) addresses this.
- Self-correction capability. No benchmark measures whether a system can fix its own citation errors when detected. VVC Phase 6 auto-corrects or flags failed claims.
An honest caveat: This is an engine factory, not a single engine. A generated engine's quality depends on how it's configured -- source hierarchies, verification scope, agent specialization. The default configuration is designed to address the gaps these benchmarks expose: VVC enabled, 100% verification across all confidence levels (HIGH, MEDIUM, LOW, and SPECULATIVE), 5-tier source hierarchies, and multi-agent parallel research. But a user who disables VVC and configures a minimal pipeline will get minimal results. The tool provides the structural advantage. The configuration determines whether you use it.
Add the marketplace and install:
/plugin marketplace add reggiechan74/deep-research-engine-creator
/plugin install deep-research-engine-creatorOr clone and load manually (for development/testing):
git clone https://github.com/reggiechan74/deep-research-engine-creator.git
claude --plugin-dir ./deep-research-engine-creator# 1. Run the wizard (with an optional preset)
/create-engine --preset market
# 2. Answer the wizard questions — domain, sources, agents, quality, output
# The wizard walks you through 9 sections with smart defaults.
# 3. Validate the generated engine
/test-engine ./generated-engines/your-engine/
# 4. Install the generated engine as a local plugin
/install-local-plugin ./generated-engines/your-engine/
# 5. Restart Claude Code, then use the engine
/your-engine:research "your topic"| Command | Description |
|---|---|
/create-engine [--preset name] |
Main wizard -- interviews you and generates a complete engine plugin |
/update-engine <path> |
Re-configure specific sections of an existing engine |
/test-engine <path> [topic] |
Validate structure and run smoke test on a generated engine |
/preview-engine <path> |
Preview what an engine config would generate (read-only) |
/list-engines [dir] |
Scan a directory for generated engines and display a summary table |
/install-local-plugin <path> |
Register a generated engine as an installed Claude Code plugin |
After generating an engine, you have two options for loading it:
Option A: Permanent install (recommended)
/install-local-plugin ./generated-engines/your-engine/
# Restart Claude Code
/your-engine:research "your topic"This creates a temporary marketplace, registers it with Claude Code's plugin system, and installs the engine permanently. The plugin persists across sessions, projects, and restarts — no flags needed.
Option B: Quick test (ephemeral)
claude --plugin-dir ./generated-engines/your-engine/Loads the engine for the current session only. Useful for quick testing before committing to a permanent install.
The /create-engine wizard automatically copies the install command to your project's .claude/commands/ directory after generation, so it's available immediately.
Presets pre-fill the wizard with domain-specific defaults for source hierarchies, agent configurations, quality rules, and output structure. You can accept them as-is or customize any section.
| Preset Flag | Domain | Key Features |
|---|---|---|
--preset academic |
Academic Research | Peer-reviewed sources, systematic review methodology |
--preset ai |
AI & Agentic Engineering | LLM benchmarks, agent frameworks, MCP/tool use, model landscape, RAG |
--preset aml |
AML & Regulatory Compliance | FATF, FinCEN, sanctions screening, PEP, beneficial ownership |
--preset biotech |
Biotechnology & Life Sciences | Drug pipelines, GenBank, clinical stages, PTRS assessment |
--preset cre |
Real Estate & CRE | Property valuations, cap rates, zoning, municipal records |
--preset cyber |
Cybersecurity & Threat Intel | CVE/NVD, MITRE ATT&CK, IOC tracking, attack surface mapping |
--preset defense |
Aerospace & Defense | DSCA, SAM.gov procurement, TRL, ITAR/EAR compliance |
--preset energy |
Energy & Utilities | EIA, FERC, IESO, ISO/RTO markets, LCOE, rate cases |
--preset esg |
ESG & Climate Risk | CDP, TCFD/ISSB, carbon accounting, governance scoring |
--preset findd |
Financial Due Diligence | SEC filings, beneficial ownership, litigation, M&A analysis |
--preset geopolit |
Geopolitical & Political Risk | Country risk, sanctions, conflict monitoring, scenario planning |
--preset infra |
Infrastructure & Development | EA registries, municipal planning, corridor analysis, P3 |
--preset insurance |
Insurance & Actuarial | NAIC filings, AM Best, catastrophe models, loss ratios |
--preset investigate |
Investigative Journalism | Public records, corporate registries, FOI, evidence chains |
--preset legal |
Legal Research | Bluebook citations, case law databases, statutory analysis |
--preset market |
Market Intelligence | SEC filings, competitive analysis, market sizing |
--preset medical |
Healthcare & Medical | PubMed, clinical trials, FDA/EMA, GRADE evidence grading |
--preset osint |
OSINT Investigation | Multi-source correlation, digital footprint analysis |
--preset policy |
Government & Public Policy | Legislative tracking, lobbying disclosure, policy impact |
--preset supply |
Supply Chain & Logistics | WTO/Comtrade, tariff analysis, supplier risk, freight data |
--preset techdd |
Technical Due Diligence | Patent databases, standards compliance, IP landscape |
Or choose Custom during the wizard to build a configuration from scratch.
A generated engine is a fully self-contained Claude Code plugin:
your-engine-name/
├── .claude-plugin/plugin.json # Plugin manifest
├── engine-config.json # Full configuration (editable, re-processable)
├── commands/
│ ├── research.md # /research <topic> [--quick|--deep|--comprehensive] [--extend] [--no-approve]
│ └── sources.md # /sources — view configured source hierarchy
├── agents/
│ ├── agent-1.md # Domain-specialized research agent
│ ├── agent-2.md # Domain-specialized analysis agent
│ └── agent-3.md # Domain-specialized mapping agent
├── skills/
│ └── your-domain/
│ └── SKILL.md # Complete research engine (~500-700 lines)
└── README.md # Auto-generated documentation
Every generated engine includes:
- A tiered research command (
/research) with quick, standard, deep, and comprehensive modes - A sources command (
/sources) for inspecting the configured credibility hierarchy - Domain-specialized agents with their own search strategies and citation prefixes
- A quality framework with confidence scoring, evidence thresholds, and validation rules
- Claim verification (VVC) -- not just citations. Every factual claim is extracted, the cited source is re-fetched, and both source credibility and accurate representation are verified. Citations can still hallucinate. Verified claims can't.
- Context isolation -- engines default to standalone mode, scoping research strictly to the user's topic. Project context (CLAUDE.md, prior research files, observation history) is ignored unless the user explicitly passes
--extend. An approval gate (default ON for Standard+ tiers) lets users review the outline before research agents execute. - Structured report output with configurable sections
| Mode | Description | When to Use |
|---|---|---|
| Self-contained | Full 5-phase research pipeline embedded. No dependencies. | Sharing, portability, standalone use |
| Extension | Overlays customizations on the base /deep-research skill. Lighter weight. |
When you already have /deep-research installed |
Self-contained engines include the complete research orchestration logic (planning, research, synthesis, reporting) in their SKILL.md. Extension engines inherit the base pipeline and only override domain-specific configuration (sources, agents, quality rules, output structure).
The wizard walks you through a structured interview to build your engine configuration. Everything is customizable -- presets are starting points, not constraints. You can accept defaults, modify individual fields, or build entirely from scratch.
Presets pre-fill sections 4-9 with domain-expert configurations. At every step, the wizard asks: "Accept, Add, Remove, Modify, or Customize?" -- you always have full control. Choosing a preset saves time; it doesn't limit what you can build.
You can also skip presets entirely and choose Custom to configure every field from a blank slate.
Section 1: Domain Identity -- Name your engine, describe the research domain, identify the target audience, choose self-contained or extension mode, and optionally select a preset.
Section 2: Research Scope -- Define the types of questions your engine handles (e.g., "landscape analysis", "competitive assessment"), geographic scope, temporal focus, and primary deliverable format.
Section 3: Sample Questions -- Provide 3-5 example research queries. The wizard uses these to auto-suggest source types, agent specializations, and report sections in later steps.
Section 4: Source Strategy -- Configure the 5-tier credibility hierarchy that governs how your engine evaluates sources:
- Review and customize each tier's name and source list (add, remove, reorder)
- Define preferred sites to prioritize in searches
- Define excluded sites to always skip
- Configure search templates with domain-specific query patterns (e.g.,
"{patent_number}" patent claims site:{preferred_site}) - Set language and geographic filters
Section 5: Agent Pipeline -- Design the multi-agent research team. Two modes:
- Basic: Review the recommended 3-agent structure (researcher, analyst, synthesizer). Accept as-is, add more agents, remove agents, or modify any agent's configuration.
- Advanced: Configure each agent individually -- ID, display name, role description, sub-agent type (
general-purpose,expert-instructor, orintelligence-analyst), model (sonnet,opus, orhaiku), detailed specialization instructions, and tool access. - Assign agents to research tiers (quick/standard/deep/comprehensive) and configure follow-up rounds.
There is no limit on the number of agents -- add as many specialized roles as your domain requires.
Section 6: Quality Framework -- Define evidence standards:
- Customize confidence level definitions (HIGH/MEDIUM/LOW/SPECULATIVE) for your domain
- Set minimum evidence thresholds (e.g., "all valuations require 3+ comparable transactions")
- Add, remove, or modify validation rules (e.g., "cross-check against official records")
- Choose citation standard (APA 7th, Bluebook, Chicago, or custom)
- Configure source verification mode (spot-check, comprehensive, or none)
- Set dead link handling, freshness thresholds, and verification reporting
Section 7: Output Structure -- Design report format:
- Choose where research reports are saved (default:
./research-reports, customizable to any project path) - Define, reorder, add, or remove report sections
- Set file naming templates with variables (
{date},{topic_slug}) - Specify special deliverables (competitive matrices, risk heatmaps, timelines, etc.)
Section 8: Advanced Configuration -- Optional power-user settings:
- Max research iterations per question (1-5)
- Exploration depth for recursive web traversal (1-10)
- Token budgets per phase (planning, research, synthesis, reporting)
- Custom hooks and MCP server integrations
Section 9: Custom Prompts -- Write the instructions your agents follow:
- Global preamble -- sets the overall research standard and audience expectations
- Per-agent overrides -- specific instructions for each agent (e.g., "always apply Porter's Five Forces")
- Synthesis instructions -- how findings should be combined across agents
- Reporting tone -- voice and style for the final report
After all 9 sections, the wizard shows a complete preview of the engine configuration. Nothing is generated until you explicitly confirm. You can go back and modify any section before proceeding.
A complete reference implementation is included:
See examples/patent-intelligence-engine/ for a working patent research engine.
This engine demonstrates the full capabilities of a generated plugin:
- 3 specialized agents:
patent-search-specialist-- searches USPTO, EPO, WIPO, and other patent officesprior-art-analyst-- applies TSM/KSR frameworks for novelty and obviousness analysisip-landscape-mapper-- builds competitive portfolio matrices and identifies whitespace
- 5-tier source hierarchy from official patent databases (Tier 1) down to unverified claims (Tier 5)
- Patent-specific quality rules including patent number verification, assignee currency checks, and claim-element FTO requirements
- 12-section report structure covering landscape overview, claims analysis, FTO assessment, IP risk matrix, and more
The system has two halves: the Engine Creator (a factory that builds engines) and the Generated Engine (a standalone plugin that runs research). The config file is the pivot point between them.
flowchart LR
subgraph WIZARD["Wizard Interview (9 Sections)"]
direction TB
S1["1. Domain Identity"]
S2["2. Research Scope"]
S3["3. Sample Questions"]
S4["4. Source Strategy"]
S5["5. Agent Pipeline"]
S6["6. Quality Framework<br/>+ VVC Config"]
S7["7. Output Structure"]
S8["8. Advanced Config"]
S9["9. Custom Prompts"]
S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 --> S9
end
PRESET["Domain Preset<br/>(20 available)"]
CONFIG["engine-config.json<br/>(source of truth)"]
subgraph GEN["Plugin Generation"]
direction TB
G1["plugin.json"]
G2["SKILL.md"]
G3["commands/"]
G4["agents/"]
G5["README.md"]
end
TEMPLATES["Templates<br/>(10 .tmpl files)"]
ENGINE["Standalone<br/>Claude Code Plugin"]
PRESET -.->|"pre-fills<br/>sections 4-9"| WIZARD
WIZARD --> CONFIG
CONFIG --> GEN
TEMPLATES -.->|"+ placeholders"| GEN
GEN --> ENGINE
classDef wizard fill:#e9d8fd,stroke:#805ad5,color:#553c9a
classDef config fill:#fef3c7,stroke:#d69e2e,color:#744210
classDef gen fill:#c6f6d5,stroke:#38a169,color:#276749
classDef output fill:#dbeafe,stroke:#2b6cb0,color:#1a365d
classDef preset fill:#feebc8,stroke:#dd6b20,color:#7b341e
class S1,S2,S3,S4,S5,S6,S7,S8,S9 wizard
class CONFIG config
class G1,G2,G3,G4,G5 gen
class ENGINE output
class PRESET,TEMPLATES preset
style WIZARD fill:#f5f0ff,stroke:#805ad5,color:#553c9a
style GEN fill:#f0fff4,stroke:#38a169,color:#276749
When a user runs /research [topic] on a generated engine, this pipeline executes:
flowchart TD
TOPIC["User runs<br/>/research topic --deep"]
subgraph P0["Phase 0: Tier Detection"]
PARSE["Parse flags and<br/>configure depth"]
end
subgraph P1["Phase 1: Research Planning"]
PLAN["Strategic framework<br/>and agent task design"]
end
GATE{"User approves<br/>outline?<br/>(Standard+ default)"}
subgraph P2["Phase 2: Parallel Research"]
direction LR
A1["Agent 1<br/>Domain Specialist"]
A2["Agent 2<br/>Analyst"]
A3["Agent 3<br/>Mapper"]
end
subgraph P3["Phase 3: Synthesis"]
SYN["Multi-source integration<br/>Contradiction resolution<br/>Gap analysis"]
end
subgraph P4["Phase 4: Draft Reporting"]
DRAFT["Draft report with<br/>claim tagging<br/>[VC] [PO] [IE]"]
end
subgraph P5["Phase 5: VVC-Verify"]
VER["Extract claims<br/>Re-fetch sources<br/>Classify alignment"]
end
subgraph P6["Phase 6: VVC-Correct"]
COR["Auto-correct errors<br/>Produce final report<br/>+ correction log"]
end
SOURCES["5-Tier Source<br/>Hierarchy"]
QUALITY["Quality Framework<br/>Confidence Scoring"]
REPORT["Comprehensive Report<br/>+ Verification Report<br/>+ Correction Log"]
TOPIC --> P0
P0 --> P1
P1 --> GATE
GATE -->|"yes"| P2
GATE -->|"revise"| P1
SOURCES -.->|"governs<br/>credibility"| P2
P2 --> P3
P3 --> P4
QUALITY -.->|"evidence<br/>thresholds"| P4
P4 --> P5
P5 --> P6
P6 --> REPORT
classDef phase fill:#dbeafe,stroke:#2b6cb0,color:#1a365d
classDef agent fill:#e9d8fd,stroke:#805ad5,color:#553c9a
classDef vvc fill:#fed7d7,stroke:#e53e3e,color:#9b2c2c
classDef input fill:#c6f6d5,stroke:#38a169,color:#276749
classDef support fill:#feebc8,stroke:#dd6b20,color:#7b341e
classDef output fill:#fef3c7,stroke:#d69e2e,color:#744210
class PARSE,PLAN,SYN phase
class A1,A2,A3 agent
class DRAFT phase
class VER,COR vvc
classDef gate fill:#fefcbf,stroke:#d69e2e,color:#744210
class TOPIC input
class SOURCES,QUALITY support
class REPORT output
class GATE gate
style P0 fill:#f0f9ff,stroke:#2b6cb0,color:#1a365d
style P1 fill:#f0f9ff,stroke:#2b6cb0,color:#1a365d
style P2 fill:#f5f0ff,stroke:#805ad5,color:#553c9a
style P3 fill:#f0f9ff,stroke:#2b6cb0,color:#1a365d
style P4 fill:#f0f9ff,stroke:#2b6cb0,color:#1a365d
style P5 fill:#fff5f5,stroke:#e53e3e,color:#9b2c2c
style P6 fill:#fff5f5,stroke:#e53e3e,color:#9b2c2c
The config file is the source of truth. You can:
- Edit it directly and re-run generation
- Use
/update-engineto re-interview specific sections - Use
/preview-engineto inspect what a config would produce without writing files - Version-control it alongside your generated engine
Push a generated engine to a marketplace repository:
./scripts/publish-engine.sh ./generated-engines/your-engine/ https://github.com/user/marketplace-repoThe script validates the engine structure, reads the name and version from engine-config.json, clones the marketplace repo, copies the engine, commits, and pushes.
Several open-source benchmarks exist for evaluating deep research agents. You can use them to validate a newly generated engine against published baselines.
| Benchmark | Best For | Dataset |
|---|---|---|
| DRACO | General quality (100 tasks, 10 domains, ~40 rubrics per task) | huggingface-cli download perplexity-ai/draco |
| ReportBench | Citation accuracy (precision, recall, hallucination rate) | Clone repo |
| DeepResearch Bench | PhD-level research quality (RACE + FACT metrics) | Clone repo |
| DeepScholar-Bench | Academic synthesis (retrieval + verifiability) | Clone repo |
| DeepSearchQA | Multi-source collation (900 prompts, 17 fields) | huggingface-cli download google/deepsearchqa |
1. Generate an engine and load it:
# With the plugin installed, run the wizard
/create-engine --preset legal
# Complete the wizard to generate the engine
# Load the generated engine for testing
claude --plugin-dir ./generated-engines/legal-research-engine2. Download a benchmark dataset:
# DRACO (recommended starting point)
pip install huggingface-hub
huggingface-cli download perplexity-ai/draco --repo-type dataset --local-dir ./benchmarks/draco3. Run benchmark queries through the engine:
# For each benchmark task, invoke the engine's research command
claude -p "/research [benchmark query] --deep" --plugin-dir ./generated-engines/legal-research-engineCollect the output reports from the engine's output directory. Each run produces a Comprehensive Report, Bibliography, and (if VVC enabled) a VVC Verification Report and Correction Log.
4. Score the outputs:
Each benchmark provides its own evaluation protocol:
- DRACO: LLM-as-judge scoring against per-task rubrics (~40 criteria each)
- ReportBench: Citation precision/recall against arXiv survey gold standards, plus statement and citation hallucination rates
- DeepResearch Bench: RACE (report quality) + FACT (citation accuracy) composite scoring
The most valuable test you can run is comparing the same engine with and without verification:
| Run | How | What It Shows |
|---|---|---|
| A: VVC enabled | Default config (VVC on, 100% all levels) | Baseline with verification |
| B: VVC disabled | Edit engine-config.json, set vvc.enabled: false |
Baseline without verification |
| A - B | Compare scores | The measurable value VVC adds to citation accuracy and factual correctness |
This delta is data no other platform can produce -- none of them have a toggle for post-draft verification.
The strongest validation of the factory model: test whether domain-specialized engines outperform general-purpose tools on domain-specific tasks.
- Generate a domain engine (e.g.,
--preset legal) - Run it on that domain's benchmark subset (e.g., DRACO's Law category, where Perplexity scores 90.2%)
- Compare to published general-purpose baselines
If a specialized engine outperforms general-purpose tools on its home domain -- even if it underperforms elsewhere -- that validates the core thesis: specialization beats generalization for professional research.
| System | DRACO Score | Notes |
|---|---|---|
| Perplexity Deep Research | 70.5% | Best overall; 90.2% on Law |
| Gemini Deep Research | 59.0% | -- |
| OpenAI o3 Deep Research | 52.1% | Slowest (avg 1,808s per task) |
Source: DRACO benchmark paper (arXiv 2602.11685)
- Claude Code with plugin support
- For extension mode only: the base
/deep-researchskill must be installed as a separate plugin
deep-research-engine-creator/
├── .claude-plugin/marketplace.json # Marketplace registry (plugin discovery)
├── .gitignore
├── CHANGELOG.md
├── LICENSE
├── README.md # This file
└── plugin/ # The plugin itself
├── .claude-plugin/plugin.json # Plugin manifest
├── commands/
│ ├── create-engine.md # /create-engine wizard
│ ├── update-engine.md # /update-engine reconfiguration
│ ├── test-engine.md # /test-engine validation suite
│ ├── preview-engine.md # /preview-engine read-only preview
│ ├── list-engines.md # /list-engines directory scanner
│ └── install-local-plugin.md # /install-local-plugin local registration
├── skills/engine-creator/
│ ├── SKILL.md # Core wizard + generation logic
│ ├── domain-presets/ # 21 domain presets
│ └── templates/ # 10 generation templates
├── scripts/
│ └── publish-engine.sh # Marketplace publishing script
├── examples/
│ └── patent-intelligence-engine/ # Complete reference implementation
└── docs/ # Design and planning documents
MIT