Providers¶

The provider layer is how SynthOrg reaches every LLM -- cloud APIs, OpenRouter, Ollama, LM Studio, vLLM, or any custom endpoint -- through a single unified interface. It handles authentication, model discovery, cost metering, health probing, and runtime hot-reload without restarting the engine.

Provider Abstraction¶

The framework provides a unified interface for all LLM interactions. The provider layer abstracts away vendor differences, exposing a single completion() method regardless of whether the backend is a cloud API, OpenRouter, Ollama, or a custom endpoint.

Unified Model Interface: completion(messages, tools, config) -> resp

	Cloud API Adapter	OpenRouter Adapter	Ollama Adapter	Custom Adapter
Method	Direct API call	400+ LLMs via OR	Local LLMs, self-host	Any API

Provider Configuration¶

Provider Configuration (YAML)

Model IDs, pricing, and provider examples below are illustrative. Actual models, costs, and provider availability are determined during implementation and loaded dynamically from provider APIs where possible.

providers:
  example-provider:
    litellm_provider: "anthropic"  # LiteLLM routing identifier (optional, defaults to provider name)
    family: "example-family"       # cross-validation grouping (optional)
    auth_type: api_key             # api_key | oauth | custom_header | subscription | none
    connection_name: "provider-example-provider"  # catalog connection holding the secret (api_key / custom_header auth)
    # subscription_token: "..."    # subscription token (subscription auth only; passed to LiteLLM as api_key; sensitive -- use env vars or secret management)
    # tos_accepted_at: "..."       # timestamp when subscription ToS was accepted
    models:                        # example entries -- real list loaded from provider
      - id: "example-large-001"
        alias: "large"
        cost_per_1k_input: 0.015   # illustrative, verify at implementation time
        cost_per_1k_output: 0.075
        max_context: 200000
        estimated_latency_ms: 1500 # optional, used by fastest strategy
      - id: "example-medium-001"
        alias: "medium"
        cost_per_1k_input: 0.003
        cost_per_1k_output: 0.015
        max_context: 200000
        estimated_latency_ms: 500
      - id: "example-small-001"
        alias: "small"
        cost_per_1k_input: 0.0008
        cost_per_1k_output: 0.004
        max_context: 200000
        estimated_latency_ms: 200
      - id: "example-image-001"
        alias: "image"
        cost_per_image: 0.04       # per-image billing for image-output models
        max_context: 1             # nominal; image models are not token-metered
        metadata:
          supports_image_generation: true

  openrouter:
    auth_type: api_key           # api_key | oauth | custom_header | subscription | none
    connection_name: "provider-openrouter"  # catalog connection holding the secret
    base_url: "https://openrouter.ai/api/v1"
    models:                        # example entries
      - id: "vendor-a/model-medium"
        alias: "or-medium"
      - id: "vendor-b/model-pro"
        alias: "or-pro"
      - id: "vendor-c/model-reasoning"
        alias: "or-reasoning"

  ollama:
    auth_type: none
    base_url: "http://localhost:11434"
    keep_alive: "5m"               # ollama-only: how long to keep a model
                                   # loaded after a request ("0" = unload
                                   # now, "-1" = keep forever; omit to use
                                   # ollama's own OLLAMA_KEEP_ALIVE default)
    models:                        # example entries
      - id: "llama3.3:70b"
        alias: "local-llama"
        cost_per_1k_input: 0.0    # free, local
        cost_per_1k_output: 0.0
      - id: "qwen2.5-coder:32b"
        alias: "local-coder"
        cost_per_1k_input: 0.0
        cost_per_1k_output: 0.0

Catalog-only credentials. ProviderConfig no longer carries an embedded api_key:. Secrets for the api_key and custom_header auth types live in the connection catalog (Fernet-encrypted at rest); the provider config references the catalog entry by connection_name, and the resolver reads the secret from there. A config that sets an api_key / custom_header auth type without a connection_name is rejected at validation time.

Operator migration. Installs that previously persisted an embedded api_key are upgraded automatically: a one-time, idempotent boot hook runs after persistence connects (before the normal provider parse), reads each stored config through a transitional schema that tolerates the old api_key, mints a catalog connection (provider-<name>) for the secret, and re-persists the config on connection_name. The boot hook never logs the key. No operator action is required; the upgrade is transparent on the first start after the change.

Cost Recording¶

Every successful scoped provider.complete() call attributes a CostRecord to the agent and task that originated the work. Attribution flows through a ContextVar middleware rather than through per-call kwargs, which keeps the provider interface uniform across cloud APIs, OpenRouter, Ollama, and custom adapters. Calls made outside any cost_recording_scope -- infrastructure probes, model discovery, the engine turn loop, tests -- read None for the active context and are intentionally not attributed: the engine's post-execution recorder owns engine turns, and probe / discovery traffic is not user spend.

Scope contract: callers wrap a provider.complete() invocation in cost_recording_scope(cost_tracker, agent_id, task_id, project_id, call_category, currency) from synthorg.providers.cost_recording. The scope is an @asynccontextmanager that captures the current ContextVar value, sets the new context, yields, and restores the captured value on exit. It restores by plain set(previous) rather than Token.reset on purpose: a streaming or SSE body can drive the enter and the exit in different asyncio contexts, and Token.reset raises ValueError when the token is reset in a context other than the one that created it, whereas a plain set is always context-safe. Nested scopes shadow the outer one and are restored on exit; concurrent tasks see independent scopes.
Chokepoint: BaseCompletionProvider.complete() reads the scope's context after a successful response, builds a CostRecord from result.usage + result.provider_metadata (_synthorg_latency_ms, _synthorg_cache_hit, _synthorg_retry_count, _synthorg_retry_reason) + result.finish_reason, and submits it via cost_tracker.record(record). Calls outside any scope (probes, model discovery, tests) are no-ops.
Skip rule: usage with both zero tokens and zero cost is skipped (matches the engine post-execution recorder). Free-tier providers with non-zero tokens still record.
Failure isolation: any exception from cost_tracker.record(...) other than MemoryError / RecursionError is logged at WARNING (PROVIDER_COST_FAILED) and swallowed -- the user-visible provider response never depends on recording success.
Engine path: the engine loop deliberately does NOT open a scope around its turn-level provider.complete() call. The post-execution record_execution_costs(...) recorder remains authoritative for engine turns because it accumulates per-turn metadata (turn number, retry counts, tool-response tokens for PTE) that the chokepoint cannot see synchronously. The chokepoint reads None and is a no-op for engine calls -- no double-counting.
Streaming: provider.stream() also records cost, via a lazy pass-through wrapper (BaseCompletionProvider._cost_recording_stream, helper record_stream_cost_if_in_scope). The wrapper forwards every chunk unchanged and captures the terminal StreamEventType.USAGE chunk; once the stream is fully drained it emits a synthetic CostRecord through the same record_cost_if_in_scope chokepoint complete() uses. Because attribution happens on drain, an early aclose() or break that abandons the iterator before the USAGE chunk skips the record (the partial stream was not fully consumed).
AST gate: scripts/check_provider_complete_chokepoint.py (pre-push + CI) walks src/synthorg/ for Await(Call(Attribute(_, "complete"))) nodes on BaseCompletionProvider instances and asserts each call site is either in an explicit allowlist (chokepoint itself, engine loop helpers, connection probes, health prober, registry docstring example) or has a cost_recording_scope opened in the same function.

This pattern mirrors synthorg.observability.correlation.correlation_scope, which is the established codebase precedent for cross-cutting per-call context bindings (request_id / task_id / agent_id).

Model pricing (real cost, not $0.00)¶

A CostRecord is only meaningful when the model carries real per-token pricing. Live-discovered models otherwise keep cost_per_1k_* at the 0.0 default forever, recording $0.00 for every call. Two back-fills close that gap, operator override always winning:

litellm back-fill: when enrichment finds a model with zero operator cost, it reads input_cost_per_token / output_cost_per_token from litellm's model info (extract_model_pricing, converted to per-1k) and sets cost_per_1k_input / cost_per_1k_output. A non-zero operator cost is never overwritten.
register unmapped ids: at registry build, register_operator_model_pricing syncs each litellm-driver provider's operator-supplied costs into litellm.model_cost via litellm.register_model, so get_model_info resolves ids litellm does not ship (e.g. a gateway's own chat model) and downstream cost math is consistent. It runs once per build, not per request.

prompt_class_id is legitimately None on a raw agent-execution turn: that path opens no cost_recording_scope and has no registered system-prompt purpose (the engine post-execution recorder owns it, by design). The $0.00 symptom is fixed by real pricing, not by fabricating a purpose. Calls that do carry a registered purpose (the tier-classifier LLM call, judging, etc.) attribute prompt_class_id normally.

Cassette Record / Replay¶

Recorded-LLM cassettes make a company run deterministic and free to re-execute: record the exact provider responses of a run keyed by request, then replay them for byte-identical re-execution with zero real LLM calls. Like cost recording, this is a provider-layer concern, not per-driver.

Seam: CassetteCompletionProvider (src/synthorg/providers/cassette/) wraps an inner driver and overrides the public complete() / stream() / get_model_capabilities() / batch_get_capabilities(). It deliberately overrides the public methods, not the _do_* hooks: BaseCompletionProvider.complete merges fresh _synthorg_latency_ms / _synthorg_retry_count into provider_metadata after _do_complete, so replaying through _do_complete would clobber the recorded metadata and break byte-identical replay. The three _do_* hooks are unreachable guards raising CassetteInternalError.
Decoration chokepoint: ProviderRegistry.from_config(..., cassette=...) wraps every driver in one shared CassetteSession before the registry is frozen, so no consumer (engine, coordinator, judge, runtime builder) can bypass record/replay. In replay the inner driver is not built at all (no factory call), so a pure replay run constructs no real provider.
Keying: SHA-256 over the canonical request (method, provider, model, messages, tools, config) via synthorg.versioning.hashing.compute_content_hash. Repeated identical requests within a run are disambiguated by a per-task FIFO lane: each distinct asyncio task is assigned a stable monotonic lane on its first provider call. Replay matching is (request_hash, lane, seq). This is stable across record and replay iff the first-call order of distinct tasks is identical, which the deterministic simulation harness provides; a cassette miss / sequence exhaustion fails loudly (CassetteReplayMissError / CassetteReplayExhaustedError) and never falls through to a real provider.
Storage: a single canonical JSON document (filesystem, no DB / no yoyo revision: this is test infrastructure). The session auto-persists after every recorded interaction (crash-safe), written atomically (temp file + rename). cassette_format_version gates incompatible formats with CassetteFormatError.
Redaction boundary (SEC-1): the replay key is hashed on the raw request, and the response / stream / capabilities outcome is stored verbatim because it is the byte-identical replay artefact. Redaction (pluggable CassetteRedactor; default PatternRedactor scrubs bearer tokens, sk- keys, AWS keys, PEM blocks, labelled secrets) applies only to the human-readable request_repr, which is never consulted for replay. Provider credentials never reach complete() (they live in driver config); the residual exposure is a model echoing a prompt secret into its own output, which is accepted and documented (cassettes are dev/test artefacts; default cassette runs use scripted/seeded providers).
Configuration: providers.cassette_mode (off / record / replay) + providers.cassette_path, resolved once at the boot site via the Cat-2 bootstrap resolver (env > code default, read_only_post_init, restart_required); off is a structural no-op.
Scope: the record/replay seam is complete and independently validated under the live engine harness (a recorded multi-turn agent run replays byte-identically with zero real provider calls). Wiring the cassette into the golden-company benchmark suite is owned by the benchmark child issue, not this seam.

LiteLLM Integration¶

The framework uses LiteLLM as the provider abstraction layer:

Unified API across 95+ providers
Built-in cost tracking
Automatic retries and fallbacks
Load balancing across providers
Chat completions-compatible interface (all providers normalised)
Model database: litellm.model_cost provides pricing and context window data for all known models. Used at provider creation to dynamically populate model lists with up-to-date metadata. At discovery each model is enriched with a ModelMetadata record (capability flags -- tools / vision / reasoning / embeddings / prompt caching, max_output_tokens, and a parsed family + sortable generation) which is persisted on ProviderModelConfig so the capability-aware matcher works offline afterwards. Ollama bypasses this DB entirely: it has no entry for locally-pulled models and would overwrite the real /api/show probe capabilities with all-False guesses, so build_capabilities (in providers/drivers/litellm_capabilities.py) forces info = {} for the ollama routing key and resolves capabilities from the persisted probe metadata instead. Provider-specific version filters (MODEL_VERSION_FILTERS, keyed by LiteLLM provider) exclude older generations; family/generation parsing is driven by MODEL_FAMILY_RULES with a generic fallback. Deduplicates dated model variants (e.g. prefers example-large-002 over example-large-002-20260205). Falls back to preset default_models when no models are found in the database.

Completion controls (reasoning, caching, streaming)¶

Three model-behaviour controls tune the LiteLLM call, each gated on a capability so a model that does not support the feature is left untouched. Two are CompletionConfig fields the driver maps onto the call (reasoning_effort and the prompt_caching flag); streaming is a loop-level behaviour driven by a setting plus the model's streaming capability, not a CompletionConfig field:

reasoning_effort (ReasoningEffort enum: minimal / low / medium / high): mapped 1:1 to LiteLLM's reasoning_effort kwarg, emitted only when the resolved model advertises supports_reasoning. Stakes routing drives it through a per-stakes StakesReasoning policy (sibling to StakesTierRequirement): the routing decision's effort is folded into the run's CompletionConfig while the agent's temperature / max_tokens are preserved. The policy is validated non-decreasing across the stakes ladder, so low-stakes work never requests deeper reasoning than high-stakes work.
Prompt caching (providers.prompt_caching_enabled, default on): when the model advertises supports_prompt_caching, drivers/litellm_cache.py rewrites the stable prefix (system block, tools block, and a rolling breakpoint before the live tail) into the content-block form carrying cache_control: {type: ephemeral} before the call, so a multi-turn run stops re-billing the unchanged prefix at full input-token cost. Non-caching models (Ollama, unknown) default the flag false and are never rewritten.
Streaming work loop (engine.work_loop_streaming_enabled, default on): when the model advertises supports_streaming the loops consume provider.stream() through one run_provider_turn() dispatcher, reassembling a CompletionResponse faithful to complete() (content, tool-call deltas, usage, and a finish_reason carried on the terminal DONE chunk) while polling cancellation and steering between chunks. See Mid-Flight Steering for the mid-turn cancel / steer-interrupt semantics. The retry / rate-limit / cost chokepoints stay in BaseCompletionProvider; the loop falls back to complete() when streaming is off or unsupported.

Provider Management¶

Providers can be managed at runtime through the API without restarting:

CRUD: POST /api/v1/providers (create), PUT /api/v1/providers/{name} (update), DELETE /api/v1/providers/{name} (delete)
Connection test: POST /api/v1/providers/{name}/test -- sends a minimal probe and reports latency
Model discovery: POST /api/v1/providers/{name}/discover-models
Queries the provider endpoint for available models (Ollama /api/tags, standard /models) and updates the provider config.
Accepts an optional preset_hint query parameter (?preset_hint={preset_name}) that guides endpoint selection (Ollama vs standard API path). The preset_hint is no longer used for SSRF trust decisions.
Auto-triggered on preset creation for no-auth providers with empty model lists.
SSRF trust is determined by a dynamic host:port allowlist (ProviderDiscoveryPolicy), seeded from preset candidate_urls at startup and auto-updated on provider create/update/delete. Trusted URLs bypass SSRF validation; untrusted URLs go through full private-IP/DNS-rebinding checks. Bypasses are logged at WARNING level (PROVIDER_DISCOVERY_SSRF_BYPASSED).
Discovery allowlist: GET /api/v1/providers/discovery-policy (read), POST /api/v1/providers/discovery-policy/entries (add entry), POST /api/v1/providers/discovery-policy/remove-entry (remove entry); manage the dynamic SSRF allowlist of trusted host:port pairs for provider discovery. Persisted in the settings system (DB > env > code).
Presets: GET /api/v1/providers/presets lists built-in cloud and local provider templates as a discriminated union (kind: "cloud" | "local"). Presets ship in two tiers, distinguished by an is_featured: bool field on the base shape:
Featured (hand-curated, branded): a curated set of cloud and local entries, each carrying a logo, vetted description, and -- where useful -- a default_models fallback list used when litellm.model_cost returns no entries. Listed first in the response and rendered in the wizard's primary grid. The current featured roster lives in _FEATURED_PRESETS in src/synthorg/providers/presets.py.
- Cloud (CloudPreset): hosted LLM APIs. Carries supported_auth_types (e.g. ["api_key"], ["api_key", "subscription"]) and a fallback default_models list. No candidate_urls (cloud endpoints are known statically; nothing to probe). An OpenAI-compatible gateway whose live /v1/models is the source of truth sets prefer_live_discovery: true (with auth_type=api_key, enforced by a model validator): from-preset skips the static litellm.model_cost table (which would surface the wrong catalogue for a gateway) and runs an authenticated live discovery to populate the full catalogue. The Bearer key is sent only when the base URL still matches the preset's canonical default_base_url; a user-overridden host is never handed the key. A gateway that ships a curated default_models seed degrades to that seed when discovery fails (a transient blip need not fail the save); a seedless gateway has no fallback, so a failed discovery (after a bounded transient retry that honours Retry-After) surfaces the specific reason (bad key / rate limit / unreachable host) rather than persisting a provider with zero models. Ollama Cloud (https://ollama.com/v1, seeded) and Mammouth (https://api.mammouth.ai/v1, seedless) both use this path.
- Local (LocalPreset): self-hosted servers (LM Studio, Ollama, vLLM). Carries candidate_urls for auto-detection and the local-management capability flags supports_model_pull / supports_model_delete / supports_model_config used by the UI to gate model lifecycle controls. Local presets may declare candidate_urls=() to opt out of auto-detection (vLLM uses this to dodge a port-8000 collision with the SynthOrg backend).
Soft (auto-derived from litellm.model_cost): one CloudPreset per chat-capable LiteLLM namespace not already covered by a featured preset and not denied by _LITELLM_NAMESPACE_DENYLIST / _LITELLM_NAMESPACE_DENY_PREFIXES. Soft presets default to auth_type=api_key, no logo (Lucide Server fallback in the picker), and a generic description. They surface every chat-capable LiteLLM provider out of the box without requiring a code change per release. Rendered in a collapsible "More providers via LiteLLM" section below the featured grid.
The requires_base_url flag is on both kinds (true for Azure on the cloud side; true for every local preset).
POST /api/v1/providers/from-preset creates a provider from any preset (featured or soft).
See docs/guides/adding-a-provider.md for the full add-a-provider workflow.
Preset auto-probe (batch): POST /api/v1/providers/probe-local -- probes every LocalPreset with non-empty candidate_urls in parallel (server-side asyncio.TaskGroup) using a 5-second timeout per URL and one rate-limit slot per call. Returns { results: { <preset_name>: ProbePresetResponse }, errors: { <preset_name>: <message> } }. Used by the setup wizard and the Settings → Providers page on mount and on user-triggered re-scan. Per-preset failures land in errors without aborting the batch (cloud presets and vLLM are excluded by construction). SSRF validation is intentionally skipped because only hardcoded preset URLs are probed, never user input. The legacy single-preset POST /api/v1/providers/probe-preset endpoint has been removed; no replacement is offered for one-off single probes (the batch endpoint covers every wizard / settings call site).
Hot-reload: On mutation, ProviderManagementService rebuilds ProviderRegistry + ModelRouter and atomically swaps both into AppState in a single field-level slice update -- no downtime, no partial swap. The persist-then-swap sequence is itself atomic with the DB write: a swap failure rolls the persisted providers.configs blob back to its prior value (re-serialised from the parsed snapshot, since the sensitive setting's stored blob is unrecoverable through the masked entry) and raises ProviderPersistenceError with an ERROR alert, so the database and the running registry never diverge. The validate / serialise / persist / swap stages each raise a distinct error (ProviderValidationError / ProviderSerializationError / ProviderPersistenceError) so the failing stage is unambiguous.
Auth types: api_key (default), subscription (token-based auth for provider subscription plans, passed to LiteLLM as api_key, requires ToS acceptance), oauth (stores credentials, MVP uses pre-fetched token), custom_header, none (local providers)
Routing key: Optional litellm_provider field decouples the provider display name from LiteLLM routing (e.g. a provider named "my-claude" can route to anthropic via litellm_provider: anthropic). Falls back to provider name when unset.
Credential safety: Secrets are Fernet-encrypted at rest via the providers.configs sensitive setting; API responses use ProviderResponse DTO that strips all secrets and provides has_api_key/has_oauth_credentials/has_custom_header/has_subscription_token boolean indicators
Persisted-config envelope: the providers.configs JSON value is wrapped in a versioned ProvidersConfigEnvelope ({ "schema_version", "providers" }). On read, the resolver validates the envelope and its schema_version; a wrong container shape, a validation failure, or an unknown version falls back to code-default providers with a structured WARNING (distinct reason) rather than silently mis-parsing the blob. A one-time boot migration upgrades a pre-envelope bare provider dict into envelope form on the same pass that moves any embedded api_key into the connection catalog.
Health: GET /api/v1/providers/{name}/health -- returns health status (up/degraded/down/unknown derived from 24h call count and error rate; unknown when no calls recorded), average response time, error rate percentage, call count, total tokens, and total cost. In-memory tracking via ProviderHealthTracker (concurrency-safe, append-only with periodic pruning). Token/cost totals are enriched from CostTracker at query time
Health probing: ProviderHealthProber background service pings providers with base_url (local/self-hosted) every 30 minutes using lightweight HTTP requests (no model loading). Ollama: pings root URL; standard providers: GET /models. Skips providers with recent real API traffic. Results are recorded in ProviderHealthTracker. Cloud providers without base_url rely on real call outcomes for health status
Model capabilities: GET /api/v1/providers/{name}/models returns ProviderModelResponse DTOs enriched with runtime capability flags (supports_tools, supports_vision, supports_streaming, supports_embeddings, supports_reasoning) from the driver layer's ModelCapabilities. Embedding models are surfaced (so the UI tags them) and are excluded from chat-agent matching, since they produce vectors, not chat completions. Falls back to defaults when driver is unavailable. Each model also carries a metadata_source provenance flag (litellm / preset / probe / unknown) recording where its capability metadata came from; when it is unknown and no capability flags are set, the dashboard renders a muted "capabilities unverified" pill rather than implying the model has none. A provider-supplied context window (max_input_tokens from a live /models listing) is carried through as max_context when plausible, and dropped in favour of the safe default above a sanity ceiling (an untrusted gateway cannot inflate the window to skew model selection). The controller issues a single call per provider via CompletionProvider.batch_get_capabilities(models) -- one controller-side dispatch instead of one per model. The default BaseCompletionProvider.batch_get_capabilities implementation still fans out per model under the hood via asyncio.TaskGroup with per-model exception suppression (failures degrade to None entries via PROVIDER_BATCH_CAPABILITIES_PARTIAL warnings; MemoryError/RecursionError propagate); only specific driver overrides can collapse upstream I/O. The LiteLLMDriver overrides with a tight in-process loop over the static preset catalog, so every list-models request incurs zero network I/O regardless of catalog size.
Local model management: Providers with supports_model_pull/supports_model_delete/supports_model_config capability flags expose model lifecycle operations. POST /api/v1/providers/{name}/models/pull streams download progress via SSE (Ollama /api/pull). DELETE /api/v1/providers/{name}/models/{model_id} removes models. PUT /api/v1/providers/{name}/models/{model_id}/config sets per-model launch parameters (LocalModelParams: num_ctx, num_gpu_layers, num_threads, num_batch, repeat_penalty). Currently implemented for Ollama; LM Studio support deferred (unstable API).
Manual model add: POST /api/v1/providers/{name}/models adds a single ModelSpec to the persisted config. Bypasses provider discovery for cases where the model isn't in litellm.model_cost. Rejects duplicates within the provider with HTTP 409. Audited.
Bulk model sync: POST /api/v1/providers/{name}/models/sync re-runs discovery + pricing + metadata enrichment and (when replace_existing=true) replaces the persisted model list. Returns SyncModelsResponse with added / removed / updated model id lists plus the post-sync model set. After persistence a best-effort model-presence probe (StaticPresenceProbe, pluggable via the ModelPresenceProbe protocol) compares each persisted/baked id against the offline LiteLLM catalogue and logs PROVIDER_MODEL_ABSENT for any id no longer advertised (foundation for the staleness/refresh work); a probe failure never fails the already-persisted sync. Audited.
Rate-limit overrides: GET /api/v1/providers/{name}/rate-limits returns the effective RateLimiterConfig; PATCH /api/v1/providers/{name}/rate-limits applies a partial update (any subset of requests_per_minute, concurrent_requests). Mutations hot-reload via ProviderManagementService and write an audit row. Empty patches are rejected. Tokens-per-minute and requests-per-hour are not yet exposed by the DTOs; the underlying RateLimiterConfig carries those fields but the PATCH surface intentionally narrows to the two operator-actionable knobs.
Credential rotation: POST /api/v1/providers/{name}/credentials/rotate accepts a discriminated-union payload over auth_type (api_key / subscription / custom_header / oauth) and replaces the encrypted secret in provider.configs without downtime. Validates that the request's auth_type matches the provider's configured auth type. Audit payload carries only the masked credential (first 4 + last 4 chars; secrets of length 8 or shorter are masked entirely, since at exactly 8 chars the prefix and suffix windows already cover every byte) plus the actor; plaintext is never logged or persisted. Requires provider_admin guard.
Preset overrides: GET /api/v1/providers/presets/{preset_name}/override returns the persisted override for one preset (or 404 if absent); PATCH /api/v1/providers/presets/{preset_name}/override upserts an override; DELETE /api/v1/providers/presets/{preset_name}/override removes it. Overrides apply globally; subsequent from-preset creations see the merged preset. Validation rejects infeasible combinations (e.g. base_url on a local preset, candidate_urls on a cloud preset). Audited.
Audit log: GET /api/v1/providers/{name}/audit?cursor=...&limit=... returns the mutation history for one provider, newest first, keyset-paginated on the integer id column. Append-only; the only mutating operation is the retention sweeper purge_before_id. Every provider mutation (create / update / delete / model add / model remove / model config edit / bulk model sync / credential rotate / rate-limit edit / preset override edit) writes one row through ProviderAuditService.record(...); audit failures never propagate out of a mutation (the persisted change is already committed by the time we reach the audit write).

Model Refresh¶

The periodic model-refresh subsystem keeps the persisted model catalogue aligned with what each provider actually advertises, and surfaces upgrade recommendations when a newer in-family model appears. It is off by default; a normal boot skips it entirely. Wiring (wire_model_refresh) is gated on providers.model_refresh_mode != off, a built provider-management service, and a connected persistence backend.

Modes (RefreshMode, the config discriminator):

Mode	Behaviour
`off`	Disabled (safe default). Nothing scheduled.
`manual_only`	No cadence; only the explicit `POST /refresh` endpoint runs a cycle.
`detect_only`	Periodically probe providers and flag removed models stale; never persists new models or emits recommendations.
`reconcile_recommend`	Probe, persist refreshed metadata, flag removed models stale, and feed upgrade recommendations.

Settings (namespace providers, DB > env > code): model_refresh_mode, model_refresh_interval_seconds (default daily, clamped to 60s-7d), and model_refresh_auto_apply_within_family (when set, strictly in-family upgrades are auto-applied instead of parked for human approval). The scheduler re-reads the live mode + auto-apply flag every tick and fails safe to off on any read error, so an operator can change mode without a restart and a settings-backend hiccup never silently runs a refresh.

API (/api/v1/providers/model-refresh, require_write_access):

GET /recommendations -- list upgrade recommendations (filter by status).
POST /recommendations/{id}/approve -- approve and reassign pinned agents.
POST /recommendations/{id}/reject -- reject (no reassignment).
POST /refresh -- run one reconcile+recommend cycle on demand (CEO/manager).
GET /status -- current refresh mode, cadence, and auto-apply flag.

The recommendation store, scheduler, and service form a both-or-neither paired invariant on ModelRefreshStateSlice; the controllers 503 when the store is unwired. Recommendations only PROPOSE; human approval still gates apply unless a strictly in-family upgrade matches the auto-apply flag.

In-family selection (UpgradeRecommender): models are grouped by (metadata.family, metadata.supports_embeddings), not by family alone. A family label can span two incompatible classes -- an embedding model (vector output) and a chat model are not drop-in replacements -- so grouping on the embedding flag prevents a newer-generation chat model from being recommended as the upgrade for an embedding model (or vice versa). Within a group, every model older than the newest generation is a candidate; the recommendation targets the newest-generation sibling with no capability regression (it must not drop a tool / vision / reasoning capability the current model has). When several newest-generation candidates qualify, the strongest is chosen by upgrade score (capability fit + context headroom + generation delta, from the registered matcher weights), with model id as a deterministic tie-break, so a larger / more capable variant is preferred over an arbitrary alphabetical pick.

Setup Model Assignment (cost + locality aware)¶

At org provisioning the template matcher (templates/model_matcher.py) assigns each agent a concrete model across all configured providers. Selection is driven by the demand a role declares (priority + requires_* mapped to a cost tier), then domination pruning and family spread. Two provider-aware guards keep the result sensible on a mixed local + cloud setup:

Prefer local when adequate (engine.matcher_prefer_local, default on): when a locally-hosted model (loopback / private / localhost base URL) already sits in the adequate band for a role, it is chosen over a paid remote of equal fit before family spread applies (so a free local model wins even against a nominally stronger remote model that sits in the same adequate band). A role a free local model can serve never silently runs on a paid cloud model instead.
Cloud capability floor (engine.matcher_min_cloud_tier, default 2): a remote provider is never auto-assigned a model whose known cost tier is below the floor, so a paid provider does not fill a role with a bottom-tier model when a stronger one exists. Local providers are exempt (free to run at any tier), and a remote model with no resolvable tier passes (optimistic); the floor relaxes if it would otherwise leave an agent unassigned.

Both are hot-reloadable (a change triggers a runtime-services rebuild via the settings subscriber, no restart), so the defaults give a sensible allocation with no operator input while remaining tunable per deployment.

Agent-eligible providers. A provider carries agent_eligible (default true). An agent_eligible=false provider stays fully usable for explicitly-configured feature calls (the chat / judge / charter / narrative models an operator sets), but contributes no models to the seeding pool and is excluded from stakes routing, so no agent is ever newly seeded onto it or routed to it. It does not immediately cut off existing traffic: an agent already pinned to the provider keeps running on it because resolve_for_pair honours the explicit (provider, model) binding, until that agent is reassigned. This lets an operator stop new agents sourcing from a gateway (added deliberately, e.g. for a specific feature model) without disrupting agents already bound to it. The flag is a per-provider field on ProviderConfig, editable through provider CRUD.

Model Routing Strategy¶

Model routing determines which LLM handles a given request. Five strategies are available, selectable via configuration:

Strategy	Behaviour
`manual`	Resolve an explicit model override; fails if not set
`role_based`	Match the agent's role to routing rules, then catalog default
`cost_aware`	Match task-type rules, then pick cheapest model within budget
`fastest`	Match task-type rules, then pick fastest model (by `estimated_latency_ms`) within budget; falls back to cheapest when no latency data is available
`smart`	Priority cascade: override > task-type > role > cheapest > fallback chain

routing:
  strategy: "smart"              # smart, fastest, role_based, cost_aware, manual
  rules:
    - task_type: "architecture"
      preferred_model: "large"
      fallback: "medium"
    - task_type: "development"
      preferred_model: "medium"
      fallback: "small"
    - task_type: "code_review"
      preferred_model: "medium"
    - task_type: "documentation"
      preferred_model: "small"
  fallback_chain:
    - "example-provider"
    - "openrouter"
    - "ollama"

Stakes-aware routing (orthogonal layer)¶

Model routing above selects which provider/model serves a request. Stakes-aware routing is a separate, pluggable layer that re-tiers that selection based on how consequential the work is. Each task (and subtask) carries a stakes level (low / normal / high / critical), assessed by the StakesAssessor.

Routing maps stakes to a required model tier (StakesTierRequirement: low to small, normal to medium, high/critical to large, validated non-decreasing), not to a benchmark quality floor. The StakesAwareStrategy computes the required tier, bumps one tier when coordination metrics are unhealthy, holds high/critical work at or above the agent's own tier for the red-team gate, then scans every agent-eligible model at or above that tier (cheapest first; models on agent_eligible=false providers are excluded) and keeps only the tool-capable ones (is_tool_capable: supports_tools true, or verified, and never a model whose tool_calls_verified is explicitly False). It picks the cheapest survivor.

When no configured model satisfies the required tier and tool-calling, routing never silently downgrades: it raises StakesModelUnavailableError (ErrorCode.STAKES_MODEL_UNAVAILABLE, 503). The engine escalates then fails: if an ApprovalGate is wired, the task is parked (action stakes:model_unavailable, risk HIGH) so an operator can add a qualifying provider or approve; otherwise it terminates FAILED with the typed error. A high-stakes task is therefore never run on a sub-tier model.

The layer is config-selectable via stakes_routing.strategy (stakes_aware default, flat to opt out) and applied in the engine before the budget auto-downgrade, so a hard budget ceiling still wins over a stakes upgrade. See Pluggable Subsystems.

Model tier classification. A model's routing tier is derived, not hardcoded per vendor. The deterministic HeuristicTierClassifier (providers/tier_assignment/) classifies each configured model from its capability metadata, in priority order: archetype id, then cost_tier, then parameter_count bands, then a cost proxy, falling back to medium at low confidence (routing must always resolve a tier or escalate, never None). The effective tier map is the heuristic overlaid by persisted operator or LLM-accepted overrides (settings blob providers.tier_assignment_overrides; no new table). Operators inspect and adjust the map through the Model Tier Assignment panel (Settings to Providers) backed by GET/PUT /api/v1/providers/tier-assignments. An opt-in LLM recommender (LlmTierRecommender, purpose system:providers:tier_classification) offers per-model and bulk tier suggestions; it runs on the operator-selected providers.tier_classifier_model and returns a typed unset state until one is picked.

Per-task multi-provider routing (v1). The stakes router resolves a tier over all agent-eligible configured providers with a deterministic CheapestSelector (models on agent_eligible=false providers are excluded from candidacy), so a tier can resolve to the cheapest model serving it across the eligible providers rather than being pinned to the boot default. After routing, the engine swaps the dispatched client to the routed model's provider (AgentEngine._resolve_provider_instance), so the API actually called and the CostRecord.provider name are always the same provider (attribution parity). If the routed provider cannot be resolved from the registry, the engine keeps the pre-routing provider + identity together so a routing miss is never a mis-attribution. System / infra services that carry no dedicated per-feature model (decomposition, evolution, compaction, red-team, vision, the conflict judge, the security evaluators, the work pipeline) dispatch on the explicit operator-set providers.default_provider, resolved through ProviderRegistry.default_provider(): a sole registered provider is that default automatically, but with two or more providers the operator must name one and there is NO alphabetical / first-registered fallback (an ambiguous default leaves those services unwired rather than silently routing to whichever provider sorts first). Enforced by check_no_provider_auto_pick.py.

Multi-Provider Model Resolution¶

An agent binds an exclusive (provider, model) pair: ModelConfig requires both a provider and a model_id, and the agent's own model always resolves to that provider, never re-derived across providers. Two gateways speaking the same wire protocol can legitimately advertise an overlapping model id (each live-discovers its own /v1/models), so a bare id can map to more than one provider; the resolver keeps all variants as a candidate tuple rather than raising a collision error, and the binding decides which one an agent uses.

Provider-scoped resolution. ModelResolver.resolve_for_pair(provider, ref) resolves a ref within one provider. Every caller that holds an agent's identity.model.provider (the budget downgrade enforcer, the CFO downgrade / routing optimiser) resolves through it, so an overlapping id never silently moves the agent onto a different provider. The run-time client is resolved from identity.model.provider directly (AgentEngine._dispatch_client_for), so the API called and the CostRecord.provider always match the agent's binding.
No bare-ref auto-resolution. There is no "resolve this model id against whichever provider happens to serve it" path. A model assignment always names its provider: a MODEL_REF setting rejects an unbound (provider-less) value at write-time, and feature builders resolve the ref's explicit provider (or the explicit default system provider), never a first-registered pick. The provider-agnostic tier archetype (example-<tier>-001) a pin records is still vendor-neutral; it is the provider that must be explicit, resolved once at dispatch, never auto-selected across gateways.
Eligibility-first selection. When the config-selected routing strategies run over their explicit provider set, they prefer agent_eligible candidates: a provider kept out of agent work wins only when it is the sole provider for the ref. Stakes routing (models_at_or_above_tier) and agent seeding exclude ineligible providers outright.

Two built-in selectors are provided:

Selector	Behaviour
`QuotaAwareSelector` (default)	Filter to providers with available quota first; within that pool (or all candidates when none have quota), prefer agent-eligible providers, then cheapest
`CheapestSelector`	Prefer agent-eligible providers, then pick the cheapest candidate by total cost per 1k tokens, ignoring quota state

The selector is injected into ModelResolver (and transitively into ModelRouter) at construction time. QuotaAwareSelector is constructed with a snapshot from QuotaTracker.peek_quota_available(), which returns a synchronous dict[str, bool] of per-provider quota availability.

All routing strategies (smart, cost_aware, fastest, etc.) and the fallback chain automatically use the injected selector when resolving model references, so multi-provider selection is transparent to the strategy layer.