Research Mode¶
Today "an agent does research" is a curl in a sandbox. Research mode replaces that with a real pipeline: a research brief becomes a synthesised, citation-backed report whose every claim resolves to a retrievable source, produced through a recorded, replayable run.
Pipeline¶
ResearchService.run(brief, *, run_id, created_by) drives six stages:
- Query planning (
QueryPlanner): decompose the brief into source-targeted sub-queries. DefaultLlmQueryPlanner. - Multi-source retrieval (
RetrievalSource), fanned out concurrently viaasyncio.TaskGroup: the internal knowledge substrate plus web, academic, and code search. A single source failing is logged and skipped; the run continues with the remaining candidates. - Source-credibility triage (
CredibilityTriage): score each candidate and drop those below the brief's threshold. DefaultHybridCredibilityTriage(deterministic heuristic prefilter, then LLM triage on the survivors). - Deduplication (
Deduplicator): collapse near-duplicate findings. DefaultLexicalDeduplicator(content-hash plus canonical-URL plus token-shingle Jaccard; deterministic). - Synthesis (
Synthesizer): the LLM writes a report citing sources by stable reference id;CitationBindervalidates every cited id resolves to a retained item. An unsourced claim raisesResearchSynthesisErrorrather than emitting an unverifiable report. - Recording: the run is persisted as a
ResearchRun.
Every step is pluggable through a protocol, a default strategy, the
build_research_service factory, and a ResearchConfig discriminator
(settings/definitions/research.py). Safe defaults ship; web, academic, and
code retrieval use vendor-agnostic provider protocols with no bundled
implementation (mirroring WebSearchProvider), so a family fans out only
once a provider is injected.
Data model¶
Frozen Pydantic v2 models (research/models.py), all extra="forbid":
ResearchBrief-- the input: question, project scope, source toggles, credibility floor, and cost / wall-clock / sub-query limits.ResearchQueryPlan/SubQuery-- the planner's decomposition.RetrievedItem-- one candidate, carrying a stableref_id, snippet, content hash, relevance, and aResearchCitation.ResearchCitation-- resolves a claim to a source: forknowledgeit embeds the reused knowledge-substrateCitation; forweb/academic/codeit carries a typed locator.SourceCredibility-- a triage verdict.ResearchClaim-- an assertion backed by at least one citation.ResearchReport-- the deliverable: summary plus cited claims plus methodology counts.ResearchRun-- the persisted, replayable record: an immutable snapshot of the brief plus the plan, retrieved items, credibility verdicts, and report.
Identifiers are required fields, never random defaults: the agent tool and
MCP handler derive (brief_id, run_id) deterministically from the request,
so an identical request reproduces the same run id.
Recording and replay¶
The run record is the single source of truth for retrieval. Two layers compose to make a whole run deterministically replayable:
- LLM calls (planning, triage, synthesis) replay through the existing
CassetteCompletionProvider. - Retrieval results are persisted on the run as
retrieved_items; replay swaps eachRetrievalSourcefor aReplayRetrievalSourcethat serves the recorded items by sub-query index. Since the plan comes from the cassetted planner, the same plan reproduces the same routing.
Triage's heuristic component, lexical dedup, and citation binding are deterministic, so given the cassette plus the run record the report is byte-stable. The default path needs no embedder.
Persistence¶
A single research_runs table stores each run as a JSON blob with
denormalised brief_id / project_id / status / created_at columns for
filtering and ordering. ResearchRunRepository composes the
IdKeyedRepository and FilteredQueryRepository generics; SQLite and
Postgres implementations are conformance-tested in lockstep.
Surfaces¶
- Agent tool
research(research/tool.py): runs a brief and returns the cited report. Built per task byResearchToolFactory. - MCP domain
research:run/research:get/research:list(meta/mcp/domains/research.py, handlers route throughResearchService), 503-ing when the service is not wired.
A REST controller and dashboard surface for operator-driven research are a follow-up; the agent tool, MCP surface, and eval lane cover the #1989 acceptance.
Security (SEC-1)¶
All retrieved external content is untrusted. Snippets are wrapped via
wrap_untrusted(TAG_RESEARCH_SOURCE, ...) only where they enter a prompt
(planning prompt for the brief, triage and synthesis prompts for sources),
never at storage. The synthesiser and triage system prompts carry the
untrusted-content directive. The research action is classified research:run
at the MEDIUM risk tier; the underlying egress is separately gated at HIGH
via external_data:request.
Evaluation¶
A kind="research" eval brief carries a ResearchBriefSpec (question,
expected claims, credibility floor, judged rubric). grade_research_run
(evals/scoring/research.py) scores a run deterministically on claim
coverage, citation resolution (every claim citation resolves to a retrieved
source -- the acceptance criterion), and cited-source credibility. The lane
records a run, replays it, asserts the report is byte-identical, and grades
it.
Acceptance¶
Given a research brief, the org produces a synthesised, citation-backed report whose claims resolve to retrievable sources, and the run is replayable. Validated by the research eval lane and the service-level replay test.