How it works

Understand Prism's request pipeline, plugin architecture, virtual keys, multi-tenancy, and configuration hierarchy.

About

Every request flows through a pipeline of plugins in a fixed order: authentication, caching, budget checks, guardrails, rate limiting, then the provider call, followed by cost tracking and logging. Cache hits skip the provider entirely. Per-org configuration keeps tenants isolated.

The request pipeline

Prism is a proxy that sits between your application and your LLM providers. Every request passes through a chain of plugins before reaching the provider, and the response passes through another chain on the way back.

The plugins run in a fixed priority order. Lower numbers run first:

Pre-request plugins (run before the provider call)

PriorityPluginWhat it does
10IP ACLBlocks requests from denied IP addresses or CIDR ranges
20AuthValidates the virtual API key, identifies the organization
30RBACChecks role-based permissions (can this key call this model?)
35CacheChecks for an exact or semantic cache match. On a hit, skips everything below and returns instantly.
40BudgetChecks org/key/user spend against configured limits
50GuardrailsRuns safety checks on the incoming request (PII, injection, blocklist, etc.)
60Tool policyFilters or rejects tool/function calls based on allow/deny lists
70ValidationValidates the model name against the model database
80Rate limitEnforces RPM/TPM limits per org, key, user, or model

Provider call

After all pre-request plugins pass, Prism forwards the request to the selected LLM provider. The routing layer picks the provider based on your configured strategy (round-robin, weighted, least-latency, etc.) and handles failover if the primary provider is down.

Post-response plugins (run after the provider responds)

Some post-plugins run sequentially because they depend on each other. The rest run in parallel for performance.

Sequential (order matters):

PriorityPluginWhat it does
35CacheWrites the fresh response to cache for future requests
40BudgetUpdates spend counters
80Rate limitUpdates rate counters
500CostCalculates the request cost from token usage and model pricing
510CreditsDeducts cost from the key’s credit balance (managed keys only)

Parallel (independent observers, run concurrently):

PriorityPluginWhat it does
900LoggingBuffers the request trace for the control plane
900AuditEmits structured audit events to configured sinks
997AlertingChecks alert rule conditions (error rate, cost, latency)
998PrometheusIncrements counters and histograms
999OpenTelemetryExports a span to your OTLP endpoint

Note

Post-plugin failures are non-fatal. If logging or metrics fail, the response has already been sent to your application. Errors are logged as warnings but never block the response.


Cache hits and short-circuiting

When the cache plugin finds an exact match at priority 35, it short-circuits the pipeline. The provider is never called, and the cached response is returned immediately.

On an exact cache hit:

  • Budget, guardrails, tool policy, validation, and rate limiting are all skipped
  • Cost and credits are skipped (no tokens were consumed)
  • Logging, audit, metrics, and alerting still run (so cache hits appear in your dashboards)

Semantic cache hits (similar but not identical requests) also short-circuit the provider call. Cost and credits plugins still run on semantic hits, unlike exact hits where they’re skipped entirely.


Virtual API keys

Prism uses virtual keys (prefixed sk-prism-) to authenticate requests. These are not your provider API keys - they’re Prism-specific keys that map to an organization and its configuration.

When a request arrives with a virtual key, Prism:

  1. Validates the key and checks it hasn’t expired or been revoked
  2. Identifies which organization the key belongs to
  3. Loads that organization’s providers, guardrails, routing rules, rate limits, and budgets
  4. Routes the request using the org’s stored provider credentials

Your application never sees or stores raw provider API keys. Rotate a provider key in Prism and every application using that org’s virtual keys picks up the change automatically.

Each virtual key can have its own restrictions:

  • Model restrictions - limit which models this key can call
  • Provider restrictions - limit which providers this key can use
  • RPM/TPM limits - per-key rate limits (independent of org limits)
  • Expiration date - auto-expires the key
  • Allowed IPs - restrict which IPs can use this key
  • Tool allow/deny lists - control which function calls are permitted
  • Guardrail overrides - change enforcement mode per key
  • BYOK (Bring Your Own Key) - let the caller supply their own provider key
  • Credit balance - managed keys with a USD budget that auto-deducts per request

Multi-tenancy

Multiple organizations share the same gateway but are completely isolated. Each organization has its own:

  • Providers and their encrypted API keys
  • Guardrails and safety policies
  • Routing rules and strategies
  • Rate limits and budgets
  • Cache namespace
  • Tool policies
  • MCP tool server registrations
  • Audit and alerting configuration

One organization’s configuration never affects another’s.

Common use cases:

  • SaaS products - each customer gets an isolated gateway environment
  • Team separation - track spend and enforce policies per team
  • Staging vs production - different configs on the same gateway
  • Resellers - provision isolated environments for downstream customers

Configuration hierarchy

When a setting is defined in multiple places, the most specific one wins:

Request headers > API key config > Organization config > Global config

For example, if the org sets cache TTL to 5 minutes but a request sends x-prism-cache-ttl: 60, that request uses a 60-second TTL. If a key has a guardrail override that sets PII detection to “log only,” it overrides the org’s “enforce” setting for requests using that key.

This lets you set sensible defaults at the org level and override them for specific keys or individual requests without changing the org config.


Hot-reload and sync

Configuration changes take effect without restarting the gateway.

Control plane sync: Every 15 seconds (configurable), the gateway pulls the latest org configs and API keys from the control plane. Only orgs whose config actually changed (detected via SHA-256 hash comparison) trigger updates. Unchanged orgs are skipped.

What happens on a config change:

  • Provider clients are rebuilt with new credentials
  • Dynamic guardrail configs are refreshed
  • Budget counters are recalculated
  • Cache namespaces are isolated per org, so one org’s cache change doesn’t affect others

Key revocation: When a key is revoked via the admin API, the revocation is broadcast to all gateway replicas via Redis pub/sub immediately - no waiting for the next 15-second sync.

Model database: The model pricing and capability database is swapped atomically via an atomic pointer. No locking, no downtime.


Sessions and metadata

Sessions: Group related requests using the x-prism-session-id header. Sessions are for grouping and analytics only. Prism does not maintain conversation state between requests.

Custom metadata: Attach arbitrary key-value pairs using the x-prism-metadata header. Metadata appears in logs and analytics for cost attribution and tracking by team, feature, user, or any custom dimension.


Streaming

For streaming requests, pre-request plugins run normally before the stream starts. The stream then flows directly to your application chunk by chunk. Post-response plugins run after the final chunk, once the full response (including token usage) is available.

Streaming requests bypass the cache entirely - both on read and write. This is because streaming responses arrive in chunks and caching partial streams creates consistency problems.


Next Steps

Was this page helpful?

Questions & Discussion