How it works

Understand Prism's request pipeline, plugin architecture, virtual keys, multi-tenancy, and configuration hierarchy.

About

Every request flows through a pipeline of plugins in a fixed order: authentication, caching, budget checks, guardrails, rate limiting, then the provider call, followed by cost tracking and logging. Cache hits skip the provider entirely. Per-org configuration keeps tenants isolated.

The request pipeline

Prism is a proxy that sits between your application and your LLM providers. Every request passes through a chain of plugins before reaching the provider, and the response passes through another chain on the way back.

The plugins run in a fixed priority order. Lower numbers run first:

Pre-request plugins (run before the provider call)

Priority	Plugin	What it does
10	IP ACL	Blocks requests from denied IP addresses or CIDR ranges
20	Auth	Validates the virtual API key, identifies the organization
30	RBAC	Checks role-based permissions (can this key call this model?)
35	Cache	Checks for an exact or semantic cache match. On a hit, skips everything below and returns instantly.
40	Budget	Checks org/key/user spend against configured limits
50	Guardrails	Runs safety checks on the incoming request (PII, injection, blocklist, etc.)
60	Tool policy	Filters or rejects tool/function calls based on allow/deny lists
70	Validation	Validates the model name against the model database
80	Rate limit	Enforces RPM/TPM limits per org, key, user, or model

Provider call

After all pre-request plugins pass, Prism forwards the request to the selected LLM provider. The routing layer picks the provider based on your configured strategy (round-robin, weighted, least-latency, etc.) and handles failover if the primary provider is down.

Post-response plugins (run after the provider responds)

Some post-plugins run sequentially because they depend on each other. The rest run in parallel for performance.

Sequential (order matters):

Priority	Plugin	What it does
35	Cache	Writes the fresh response to cache for future requests
40	Budget	Updates spend counters
80	Rate limit	Updates rate counters
500	Cost	Calculates the request cost from token usage and model pricing
510	Credits	Deducts cost from the key’s credit balance (managed keys only)

Parallel (independent observers, run concurrently):

Priority	Plugin	What it does
900	Logging	Buffers the request trace for the control plane
900	Audit	Emits structured audit events to configured sinks
997	Alerting	Checks alert rule conditions (error rate, cost, latency)
998	Prometheus	Increments counters and histograms
999	OpenTelemetry	Exports a span to your OTLP endpoint

Note

Post-plugin failures are non-fatal. If logging or metrics fail, the response has already been sent to your application. Errors are logged as warnings but never block the response.

Cache hits and short-circuiting

When the cache plugin finds an exact match at priority 35, it short-circuits the pipeline. The provider is never called, and the cached response is returned immediately.

On an exact cache hit:

Budget, guardrails, tool policy, validation, and rate limiting are all skipped
Cost and credits are skipped (no tokens were consumed)
Logging, audit, metrics, and alerting still run (so cache hits appear in your dashboards)

Semantic cache hits (similar but not identical requests) also short-circuit the provider call. Cost and credits plugins still run on semantic hits, unlike exact hits where they’re skipped entirely.

Virtual API keys

Prism uses virtual keys (prefixed sk-prism-) to authenticate requests. These are not your provider API keys - they’re Prism-specific keys that map to an organization and its configuration.

When a request arrives with a virtual key, Prism:

Validates the key and checks it hasn’t expired or been revoked
Identifies which organization the key belongs to
Loads that organization’s providers, guardrails, routing rules, rate limits, and budgets
Routes the request using the org’s stored provider credentials

Your application never sees or stores raw provider API keys. Rotate a provider key in Prism and every application using that org’s virtual keys picks up the change automatically.

Each virtual key can have its own restrictions:

Model restrictions - limit which models this key can call
Provider restrictions - limit which providers this key can use
RPM/TPM limits - per-key rate limits (independent of org limits)
Expiration date - auto-expires the key
Allowed IPs - restrict which IPs can use this key
Tool allow/deny lists - control which function calls are permitted
Guardrail overrides - change enforcement mode per key
BYOK (Bring Your Own Key) - let the caller supply their own provider key
Credit balance - managed keys with a USD budget that auto-deducts per request

Multi-tenancy

Multiple organizations share the same gateway but are completely isolated. Each organization has its own:

Providers and their encrypted API keys
Guardrails and safety policies
Routing rules and strategies
Rate limits and budgets
Cache namespace
Tool policies
MCP tool server registrations
Audit and alerting configuration

One organization’s configuration never affects another’s.

Common use cases:

SaaS products - each customer gets an isolated gateway environment
Team separation - track spend and enforce policies per team
Staging vs production - different configs on the same gateway
Resellers - provision isolated environments for downstream customers

Configuration hierarchy

When a setting is defined in multiple places, the most specific one wins:

Request headers > API key config > Organization config > Global config

For example, if the org sets cache TTL to 5 minutes but a request sends x-prism-cache-ttl: 60, that request uses a 60-second TTL. If a key has a guardrail override that sets PII detection to “log only,” it overrides the org’s “enforce” setting for requests using that key.

This lets you set sensible defaults at the org level and override them for specific keys or individual requests without changing the org config.

Hot-reload and sync

Configuration changes take effect without restarting the gateway.

Control plane sync: Every 15 seconds (configurable), the gateway pulls the latest org configs and API keys from the control plane. Only orgs whose config actually changed (detected via SHA-256 hash comparison) trigger updates. Unchanged orgs are skipped.

What happens on a config change:

Provider clients are rebuilt with new credentials
Dynamic guardrail configs are refreshed
Budget counters are recalculated
Cache namespaces are isolated per org, so one org’s cache change doesn’t affect others

Key revocation: When a key is revoked via the admin API, the revocation is broadcast to all gateway replicas via Redis pub/sub immediately - no waiting for the next 15-second sync.

Model database: The model pricing and capability database is swapped atomically via an atomic pointer. No locking, no downtime.

Sessions and metadata

Sessions: Group related requests using the x-prism-session-id header. Sessions are for grouping and analytics only. Prism does not maintain conversation state between requests.

Custom metadata: Attach arbitrary key-value pairs using the x-prism-metadata header. Metadata appears in logs and analytics for cost attribution and tracking by team, feature, user, or any custom dimension.

Streaming

For streaming requests, pre-request plugins run normally before the stream starts. The stream then flows directly to your application chunk by chunk. Post-response plugins run after the final chunk, once the full response (including token usage) is available.

Streaming requests bypass the cache entirely - both on read and write. This is because streaming responses arrive in chunks and caching partial streams creates consistency problems.

How it works

About

The request pipeline

Pre-request plugins (run before the provider call)

Provider call

Post-response plugins (run after the provider responds)

Cache hits and short-circuiting

Virtual API keys

Multi-tenancy

Configuration hierarchy

Hot-reload and sync

Sessions and metadata

Streaming

Next Steps

Quickstart

Configuration

Supported providers

Guardrails

Questions & Discussion

FutureAGI AI Assistant

About

The request pipeline

Pre-request plugins (run before the provider call)

Provider call

Post-response plugins (run after the provider responds)

Cache hits and short-circuiting

Virtual API keys

Multi-tenancy

Configuration hierarchy

Hot-reload and sync

Sessions and metadata

Streaming

Next Steps

Quickstart

Configuration

Supported providers

Guardrails

Questions & Discussion