Rate Limiting
Control request throughput to the Prism AI Gateway with configurable rate limits.
About
Rate limiting protects your gateway from traffic spikes and controls API consumption. Prism enforces rate limits at the gateway level, returning 429 responses when thresholds are exceeded. When rate limited, the gateway includes X-Ratelimit-* headers so clients know when to retry.
Configuration
Rate limiting is configured in your config.yaml file at the gateway level:
rate_limiting:
enabled: true
global_rpm: 1000 # Maximum requests per minute (0 = unlimited)
The global_rpm setting applies to all incoming requests through this gateway instance. Set it to 0 for unlimited requests, or specify a positive integer to enforce a per-minute ceiling.
Note
Rate limiting is enforced globally across all requests to the gateway. Per-user, per-key, or per-model rate limiting is not currently supported.
Response headers
When rate limiting is enabled, every response includes headers that tell you about your current quota:
| Header | Description |
|---|---|
X-Ratelimit-Limit-Requests | The maximum number of requests allowed per minute |
X-Ratelimit-Remaining-Requests | How many requests you have left in the current minute |
X-Ratelimit-Reset-Requests | Unix timestamp (seconds) when the rate limit window resets |
Use these headers to monitor your usage and back off before hitting the limit.
Error response (429)
When you exceed the rate limit, the gateway returns a 429 status code with this error body:
{
"error": {
"type": "rate_limit_exceeded",
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded. Please retry after the window resets."
}
}
Reading rate limit headers
Here’s how to inspect rate limit headers using cURL:
curl -i -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-..." \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}]
}'
The -i flag includes response headers in the output. Look for the X-Ratelimit-* headers in the response.
Handling rate limit errors
When you receive a 429 response, wait until the X-Ratelimit-Reset-Requests timestamp before retrying. Here’s how to implement retry logic in your application:
import time
from prism import Prism
from prism._exceptions import PrismError
client = Prism(
api_key="sk-prism-...",
base_url="https://gateway.futureagi.com"
)
def call_with_retry(max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
except PrismError as e:
if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
# Exponential backoff: 1s, 2s, 4s
time.sleep(2 ** attempt)
continue
raise
result = call_with_retry()
print(result)import { Prism, RateLimitError } from "@futureagi/prism";
const client = new Prism({
apiKey: "sk-prism-...",
baseUrl: "https://gateway.futureagi.com",
});
async function callWithRetry(maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
});
} catch (error) {
if (error instanceof RateLimitError && attempt < maxRetries - 1) {
// Exponential backoff: 1s, 2s, 4s
await new Promise((resolve) =>
setTimeout(resolve, Math.pow(2, attempt) * 1000)
);
continue;
}
throw error;
}
}
}
const result = await callWithRetry();
console.log(result); Tip
Use exponential backoff when retrying. Start with a short delay (1 second) and double it with each retry. This prevents overwhelming the gateway when multiple clients hit the limit simultaneously.