Model Tier Policy¶
A model pin records a design tier, not a vendor model. SynthOrg is
provider-agnostic: no canonical vendor model is privileged, so a prompt
class pins one of the vendor-agnostic archetype tiers
(example-large-001, example-medium-001, example-small-001) that
heuristic_tier (in synthorg.budget.model_tier) resolves. This page
documents which tier each system prompt class is pinned to and the
reasoning behind it.
The policy lives in synthorg.llm.model_tier_policy. It maps every
PromptPurposeId in the prompt-purpose registry
(synthorg.llm.prompt_purpose) to a tier, with an import-time guard that
rejects any purpose missing an entry. The
pin-validation benchmark consumes
the policy to validate each prompt class against its pinned tier, and the
per-class ModelPinMetadata rollout assigns its tiers from it.
Cognitive-load taxonomy¶
The tier judgement is grounded in what the prompt asks the model to do, not in which subsystem the prompt lives in. Each purpose is assigned a kind, and the kind determines the tier:
| Kind | Tier | What the prompt does |
|---|---|---|
classify_route_triage |
small |
Bounded-output classification, routing, triage, and connection probes. The answer space is small and the cost of a cheap model is low. |
judge_grade_verify |
medium |
Evaluative judgements, grading, verification, consolidation, and run-time intervention proposals. Needs reliable reasoning but not open-ended generation. |
synthesise_generate_author |
large |
Open-ended synthesis, generation, authoring, code modification, and planning. Quality scales with capability, so the strongest tier is justified. |
Pinned tiers¶
Tiers per registered prompt purpose, grouped by tier.
Small (example-small-001)¶
| Prompt class | Purpose |
|---|---|
system:security:safety_classifier |
Classify whether content is safe before an agent acts on it. |
system:security:uncertainty |
Estimate model uncertainty for a security decision. |
system:memory:rerank |
Rerank retrieved memories for query relevance. |
system:memory:retrieval_route |
Route a retrieval query across the memory hierarchy. |
system:memory:retrieval_retry |
Reformulate and retry a failed memory retrieval. |
system:memory:fine_tune_query |
Generate a fine-tuning query for the embedding model. |
system:research:triage |
Triage a research brief into actionable directions. |
system:cos:routing |
Route a chief-of-staff request to a capability. |
system:intake |
Clarify an incoming request during intake. |
system:hr:calibration |
Sample calibration judgements for performance scoring. |
system:providers:test_connection |
Probe a provider connection with a minimal completion. |
Medium (example-medium-001)¶
| Prompt class | Purpose |
|---|---|
system:security:llm_evaluator |
Evaluate a security policy question with an LLM judge. |
system:vision_verify |
Verify a review artefact with a vision model. |
system:red_team:grounding |
Ground red-team probes against the target substrate. |
system:memory:consolidate |
Consolidate raw memories into durable entries. |
system:memory:compress |
Compress memory artefacts to reclaim context budget. |
system:procedural:success_proposer |
Propose procedural memories from successful runs. |
system:procedural:propose |
Propose a procedural memory from a task trace. |
system:cos:chat |
Answer an operator question about the organisation. |
system:cos:narrative |
Narrate organisational state for the operator. |
system:steering:propose |
Propose a steering intervention for a running task. |
system:evolution:propose |
Propose an evolution to an agent's behaviour. |
system:workspace |
Answer a semantic query over a task workspace. |
system:verification |
Grade a deliverable against quality criteria. |
Large (example-large-001)¶
| Prompt class | Purpose |
|---|---|
system:memory:abstractive |
Produce an abstractive summary of a memory set. |
system:knowledge:synthesis |
Synthesise a knowledge entry from source material. |
system:research:synthesis |
Synthesise research findings into a brief answer. |
system:research:planning |
Plan the steps to answer a research brief. |
system:cos:propose |
Propose an organisational change to the operator. |
system:charter:interview |
Interview the operator to draft an org charter. |
system:toolsmith:author |
Author a new tool definition for the toolsmith. |
system:meta:code_modification |
Modify code as part of a self-improvement strategy. |
system:client:requirement_generator |
Generate client requirements for a synthetic project. |
system:hr:training_curation |
Curate training examples from agent transcripts. |
Pin-validation benchmark¶
The policy is not advisory: the model-pin-validation
ExternalBenchmark
exercises it on every eval cycle. For each prompt class it builds the
canonical pin (the policy tier plus the deterministic sampling
parameters), runs a canonical probe against the pinned tier through a
deterministic provider, and grades drift by comparing a live
fingerprint, sha256(model_id | temperature | top_p | max_tokens | output),
against a committed golden snapshot (pin_golden.json). The sampling
floats are serialised by their exact float.hex() representation in the
digest, so every distinct sampling value hashes differently and the
digest stays bit-reproducible across runs and platforms.
A mismatch (a tier reassignment, a sampling change, or a probe-pipeline
change) fails the grade until the golden is deliberately regenerated with
scripts/refresh_model_pin_golden.py. Because the golden is an
independent snapshot, the check is a genuine regression gate, not a
"pin checks the pin" tautology.
On a clean grade the benchmark stamps validated_at for the prompt class
through the ModelPinValidationLedger (a one-row-per-class
ModelPinValidationRepository record). That validated_at is the durable
"last validated against its tier" timestamp the audit dashboard reads, the
live counterpart to a prompt class's static
ModelPinMetadata.model_version_pinned_at. The stamp is best-effort: a
persistence failure is logged but never flips a clean drift verdict.
Changing a pin¶
Reassigning a tier is a deliberate act:
- Edit the entry in
synthorg.llm.model_tier_policy. - Run
uv run python scripts/refresh_model_pin_golden.pyto regenerate the golden snapshot (the benchmark fails until you do). - Commit both changes together. The next eval cycle re-validates the pin
and refreshes its
validated_at.