Knowledge and Provenance Substrate¶
Designed behaviour; runtime in active development
This page is the source of truth for the designed behaviour of this subsystem. The components are built and unit-tested; the ingestion and retrieval pipeline runs inside a live agent, which is in active development (see the Roadmap).
SynthOrg separates three storage concerns:
- Agent memory (Mem0): what an agent remembers about a run. See Memory and Persistence.
- Living documentation: the org documenting itself as a dual-purpose wiki plus RAG namespace. See Living Documentation.
- Knowledge substrate (this page): heavy-duty document/knowledge RAG over an ingested external corpus (specs, codebases, web pages, tickets) where every retrieved chunk carries a traceable citation.
The distinction matters: a product studio must answer "what does the 500-page spec say about retries?" grounded with a citation that resolves to the exact page and region, not paraphrased from an agent's fallible memory. Grounded, citable output is the trust backbone for the red-team grounding check, the benchmark, research mode, and deliverable receipts.
What is reused, what is new¶
The substrate reuses the proven memory retrieval stack rather than reinventing vector infrastructure:
- The
MemoryBackendprotocol is the pluggable vector store. Knowledge chunks are stored as memory entries in a dedicated namespace under a syntheticSYSTEM_KNOWLEDGE_AGENT_ID, mirroring how living documentation usesSYSTEM_DOCS_AGENT_ID. - Hybrid retrieval (dense + BM25 sparse + Reciprocal Rank Fusion + optional LLM
rerank) comes for free via the per-agent
backend.retrievepath (memory/ranking.py::fuse_ranked_lists,memory/sparse.py::BM25Tokenizer). - Persistence, manifest wiring, and the system-agent indexing idiom follow the living-documentation engine's patterns.
What is new is everything the living-documentation chunker cannot express:
multi-source loading, structure-aware chunking (AST for code, sections for
documents, page/region for PDFs), and a provenance model precise enough for
citations. The living-documentation DocChunker is structurally bound to
LivingDocument/DocBlock and cannot represent a PDF page, a source file, or
a web page without discarding the page, AST, and offset data that citations
need. The substrate therefore ships a parallel src/synthorg/knowledge/
package, not an extension of docs_engine.
Pipeline¶
flowchart LR
Src[Source: PDF / web / repo / ticket] --> Loader[SourceLoader]
Loader -->|RawDocument| Chunker[StructureAwareChunker]
Chunker -->|KnowledgeChunk + ProvenanceLocator| Indexer[KnowledgeIndexer]
Indexer -->|MemoryBackend.store| Mem[(KNOWLEDGE namespace)]
Indexer --> SrcRepo[(KnowledgeSourceRepository)]
Indexer --> ProvRepo[(ChunkProvenanceRepository)]
Query[Agent query] --> Retriever[KnowledgeRetriever]
Retriever -->|hybrid dense+BM25+RRF| Mem
Retriever -->|resolve citation| ProvRepo
Retriever --> Hit[KnowledgeHit + Citation]
Surface¶
src/synthorg/knowledge/
models.py KnowledgeSource, KnowledgeChunk, ProvenanceLocator union,
Citation, KnowledgeHit, RawDocument / RawUnit
config.py KnowledgeConfig + loader / chunker discriminators
constants.py namespace, system agent id, tag prefixes, chunk budgets
errors.py KnowledgeError family (DomainError subclasses)
loaders/
protocol.py SourceLoader
pdf.py PdfLoader (pdfplumber, per-page; thread-offloaded)
web.py WebLoader (injected HtmlFetcher + HTMLParseGuard sanitise)
repo.py RepoLoader (deterministic local tree walk)
ticket.py TicketLoader (staged on the #1991 governed connection)
factory.py build_source_loader
chunking/
protocol.py StructureAwareChunker + ChunkPiece
code.py tree-sitter AST chunker (function / class / method units)
document.py OffsetChunker: char-offset chunker shared by documents,
ticket threads, and PDF pages (paragraph-packed)
factory.py build_chunker + chunk_raw_document orchestration
indexer.py KnowledgeIndexer
freshness.py content-hash dedup and invalidation
retrieval.py KnowledgeRetriever
service.py KnowledgeService
factory.py build_knowledge_service -> KnowledgeRuntime
tool_factory.py KnowledgeToolFactory (per-task agent tools)
Data model¶
All models are frozen Pydantic v2 with extra="forbid".
Enums¶
| Enum | Values | Purpose |
|---|---|---|
SourceType |
PDF, WEB, REPO, TICKET, DESIGN_DOC |
Origin of an ingested source. |
ContentKind |
CODE, DOCUMENT, PDF_PAGE, TICKET_THREAD |
Drives chunker selection. |
SourceStatus |
PENDING, INDEXED, STALE, FAILED |
Ingestion lifecycle state. |
A new top-level MemoryCategory.KNOWLEDGE is added (alongside PROJECT_DOC)
so knowledge entries are routed and filtered distinctly from agent memory.
KnowledgeSource¶
A registered corpus source. Identified by source_id; scoped to a project or
global (project_id = None).
| Field | Type | Notes |
|---|---|---|
source_id |
NotBlankStr |
Primary key. |
source_type |
SourceType |
|
project_id |
NotBlankStr \| None |
None means global (shared across projects). |
uri |
NotBlankStr |
File path, URL, repo@ref, or ticket reference. |
title |
NotBlankStr |
Human label. |
content_hash |
NotBlankStr |
Hash of source bytes; short-circuits re-ingest. |
status |
SourceStatus |
|
chunk_count |
int |
|
created_at / updated_at |
AwareDatetime |
|
last_indexed_at |
AwareDatetime \| None |
|
last_error |
NotBlankStr \| None |
Safe error description on failure. |
ProvenanceLocator (the citation precision model)¶
A discriminated union on locator_kind. Each variant captures exactly enough
to resolve a chunk back to its source region.
| Variant | Fields |
|---|---|
PdfLocator |
page: int, bbox: tuple[float, float, float, float] \| None, char_start: int, char_end: int |
WebLocator |
url: NotBlankStr, css_path: str \| None, char_start: int, char_end: int |
CodeLocator |
path: NotBlankStr, line_start: int, line_end: int, symbol: str \| None, ast_path: str \| None |
TicketLocator |
ticket_id: NotBlankStr, comment_id: str \| None, char_start: int, char_end: int |
KnowledgeChunk, Citation, KnowledgeHit¶
KnowledgeChunk:chunk_id,source_id,content_kind,chunk_index,text,content_hash,locator: ProvenanceLocator,tags.Citation: the resolvable handle returned with every hit:source_id,chunk_id,source_type,title,uri,locator,content_hash.KnowledgeHit:chunk_text,relevance_score,citation.
RawDocument / RawUnit are the loader output (unit text plus raw locator
fields) handed to the chunker. External dict ingestion at the loader boundary
goes through parse_typed().
Ingestion¶
Loaders¶
Every loader satisfies the SourceLoader protocol
(load(source: KnowledgeSource) -> RawDocument). Selection is factory-based on
SourceType, so a new source type is a new strategy plus a registry entry.
| Loader | Source | Notes |
|---|---|---|
PdfLoader |
PDF, DESIGN_DOC |
pdfplumber per page (parsing offloaded to a worker thread); one RawUnit per page with a PdfLocator(page, char offsets). Citations resolve to the page; word-level bbox refinement is a planned follow-up (the field exists, unset today). |
WebLoader |
WEB |
Fetches via an injected HtmlFetcher (the factory wires one on the governed HTTP path: network policy, SSRF, DNS pinning), sanitises with HTMLParseGuard to strip scripts and hidden-injection vectors, and emits one DOCUMENT unit with a WebLocator. |
RepoLoader |
REPO |
Walks the local repo tree deterministically, skips VCS-internal / vendored / binary / oversized files, and emits one CODE unit per text file with a CodeLocator (repo-relative path + line span). |
TicketLoader |
TICKET |
Live fetch routes through the merged governed external-API access tool (#1991). Transport wiring is staged after the MVP corpus, so the loader currently raises KnowledgeSourceUnavailableError rather than degrade silently. |
PDF support is pdfplumber (MIT). pymupdf is deliberately excluded: its AGPL licence is incompatible with the project's BUSL-to-Apache model.
Structure-aware chunking¶
Every chunker satisfies StructureAwareChunker
(chunk_unit(unit: RawUnit) -> tuple[ChunkPiece, ...]). Selection is
factory-based on ContentKind; chunk_raw_document dispatches each unit and
assigns deterministic positional chunk ids. Naive fixed-window chunking is
never the primary strategy; it is only an explicit last-resort fallback.
| Chunker | ContentKind | Strategy |
|---|---|---|
CodeChunker (code.py) |
CODE |
tree-sitter parse via the standard Parser + get_language; split at function / class / method boundaries; CodeLocator with line span, symbol, and AST path. Language is chosen from the file extension; an unknown extension or absent grammar degrades to a deterministic line-window split. |
OffsetChunker (document.py) |
DOCUMENT, TICKET_THREAD, PDF_PAGE |
Paragraph-packed split under a token budget; refines the unit's locator (WebLocator / TicketLocator / PdfLocator) with each chunk's char offsets, so PDF chunks keep their page (and bbox). |
Retrieval¶
KnowledgeRetriever.search(query, *, project_id, limit) builds a MemoryQuery
(text plus the KNOWLEDGE namespace plus scope tags) and calls
backend.retrieve(SYSTEM_KNOWLEDGE_AGENT_ID, query), which already runs the
dense + BM25 RRF hybrid and optional rerank. Each MemoryEntry becomes a
KnowledgeHit; the retriever resolves the full Citation by batched lookup
(ChunkProvenanceRepository.get_many). The scope filter matches
project:<id> OR scope:global.
Two retrieval paths, both first-class:
- Transparent (
ProjectAwareMemoryFacade): the facade fans out viaasyncio.TaskGroupto the agent's own memories, project living-docs, AND the knowledge namespace, merging by descending relevance. An agent gets cited corpus hits without calling any special tool. - Explicit (
SearchKnowledgeTool, MCPknowledge:search): an agent or operator runs a corpus-only query.
Untrusted-content boundary (SEC-1)¶
Every KnowledgeHit.chunk_text is wrapped via wrap_untrusted(...) at the
retrieval boundary, on both the explicit tool result and the facade fan-out.
This applies to all source types, not only HTML: a PDF, source file, or
ticket comment can carry injected instructions just as a web page can.
Wrapping happens at retrieval, never on storage.
Freshness and invalidation¶
KnowledgeSource.content_hash is the hash of the source bytes;
KnowledgeChunk.content_hash is per chunk (both via
synthorg.versioning.hashing.compute_content_hash).
- A re-ingest whose top-level source hash is unchanged short-circuits with no work.
- Otherwise the source is re-loaded and re-chunked, new chunk hashes are
compared against the stored provenance rows, and only changed or new chunks
are re-embedded. Removed chunks are deleted by
chunk:<id>/source:<id>tag (idempotent, mirroringDocIndexer._delete_prior). Editing one line of a 500-page spec re-embeds one chunk, not the whole document. reindex(source_id)forces a reload;list(stale_only=True)surfaces sources whose source hash has drifted. Status transitions log at INFO after the persistence write.
Persistence¶
Two repository protocols in persistence/knowledge_protocol.py, both
composing the generic categories from _generics.py:
KnowledgeSourceRepository(IdKeyedRepository, FilteredQueryRepository):save/get/delete/list_items/query/countoverKnowledgeSource, filtered by project, scope, type, status, and a stale-only flag.ChunkProvenanceRepository(IdKeyedRepository, FilteredQueryRepository):ChunkProvenanceRowkeyed bychunk_id(a row is replaced on re-index, so it is keyed rather than an immutable event log). Two bespoke methods are added under ADR-0001 D7:get_many(chunk_ids)(performance: resolve citations for a whole hit page in one round trip, avoiding N+1 reads) anddelete_by_source(source_id)(domain invariant: re-index must purge a source's provenance atomically; callers must not bypass it with per-row deletes).
Concrete SQLite and Postgres implementations live under
persistence/{sqlite,postgres}/ and are exposed on PersistenceBackend.
Schema ships as new yoyo revisions for both backends (never edit an existing
revision). Dual-backend conformance tests are required.
Namespace and tags¶
| Constant | Value |
|---|---|
KNOWLEDGE_MEMORY_NAMESPACE |
knowledge |
SYSTEM_KNOWLEDGE_AGENT_ID |
_system:knowledge |
| Tag prefix | Purpose |
|---|---|
source:<id> |
Identifies the source doc; used for idempotent re-index delete. |
chunk:<id> |
Identifies the chunk; used for citation resolution and targeted delete. |
project:<id> |
Project scope filter. |
scope:global |
Marks a global (cross-project) source. |
kind:<content_kind> |
Lets a search hit expose its content kind without a repository lookup. |
API surface¶
REST (read-only, dashboard):
| Method | Path | Returns |
|---|---|---|
GET |
/projects/{project_id}/knowledge |
Paginated KnowledgeSource[] |
GET |
/knowledge |
Paginated global KnowledgeSource[] |
GET |
/projects/{project_id}/knowledge/search?q=... |
KnowledgeHit[] ordered by relevance |
Agent tools (in-process, per-task binding):
SearchKnowledgeTool(memory:readaction type)IngestKnowledgeTool(knowledge:ingestaction type, admin via TrustService)
MCP handlers (operator-driven, meta/mcp/domains/knowledge.py):
knowledge:search(read capability)knowledge:ingest,knowledge:reindex(admin capability, guardrail triple)
Configuration¶
KnowledgeConfig (frozen) defaults to enabled=False until setup wires it.
It carries the pdf_loader and code_chunker discriminators (defaults
pdfplumber / tree_sitter) and a reranker_enabled flag. Chunk budgets and
namespace/tag constants live in knowledge/constants.py as module-level
Final values because they are part of the on-disk plus RAG-index contract: a
runtime change would silently invalidate previously indexed chunks (the same
rationale and gate allow-list as the living-documentation engine). Optional
dependencies (pdfplumber, tree-sitter, tree-sitter-language-pack) ship
under a knowledge extras group and import lazily; their absence raises
KnowledgeDependencyError with install guidance so the base install stays
lean.
Acceptance¶
Ingest a mixed corpus (a repo, a PDF, and several web pages); an agent answers
a question with citations that resolve to the exact source chunk (PDF page and
region, code line span, or web URL and offset); changing a source invalidates
and re-indexes only the changed chunks. Validated end-to-end under the
simulation harness by tests/integration/knowledge/test_knowledge_round_trip.py,
plus the per-component unit suite under tests/unit/knowledge/ and dual-backend
persistence conformance under tests/conformance/persistence/.