Rate Limiting

Control request throughput to the Prism AI Gateway with configurable rate limits.

About

Rate limiting protects your gateway from traffic spikes and controls API consumption. Prism enforces rate limits at the gateway level, returning 429 responses when thresholds are exceeded. When rate limited, the gateway includes X-Ratelimit-* headers so clients know when to retry.

Configuration

Rate limiting is configured in your config.yaml file at the gateway level:

rate_limiting:
  enabled: true
  global_rpm: 1000    # Maximum requests per minute (0 = unlimited)

The global_rpm setting applies to all incoming requests through this gateway instance. Set it to 0 for unlimited requests, or specify a positive integer to enforce a per-minute ceiling.

Note

Rate limiting is enforced globally across all requests to the gateway. Per-user, per-key, or per-model rate limiting is not currently supported.

Response headers

When rate limiting is enabled, every response includes headers that tell you about your current quota:

HeaderDescription
X-Ratelimit-Limit-RequestsThe maximum number of requests allowed per minute
X-Ratelimit-Remaining-RequestsHow many requests you have left in the current minute
X-Ratelimit-Reset-RequestsUnix timestamp (seconds) when the rate limit window resets

Use these headers to monitor your usage and back off before hitting the limit.

Error response (429)

When you exceed the rate limit, the gateway returns a 429 status code with this error body:

{
  "error": {
    "type": "rate_limit_exceeded",
    "code": "rate_limit_exceeded",
    "message": "Rate limit exceeded. Please retry after the window resets."
  }
}

Reading rate limit headers

Here’s how to inspect rate limit headers using cURL:

curl -i -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

The -i flag includes response headers in the output. Look for the X-Ratelimit-* headers in the response.

Handling rate limit errors

When you receive a 429 response, wait until the X-Ratelimit-Reset-Requests timestamp before retrying. Here’s how to implement retry logic in your application:

import time
from prism import Prism
from prism._exceptions import PrismError

client = Prism(
    api_key="sk-prism-...",
    base_url="https://gateway.futureagi.com"
)

def call_with_retry(max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": "Hello"}],
            )
        except PrismError as e:
            if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
                # Exponential backoff: 1s, 2s, 4s
                time.sleep(2 ** attempt)
                continue
            raise

result = call_with_retry()
print(result)
import { Prism, RateLimitError } from "@futureagi/prism";

const client = new Prism({
  apiKey: "sk-prism-...",
  baseUrl: "https://gateway.futureagi.com",
});

async function callWithRetry(maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({
        model: "gpt-4o",
        messages: [{ role: "user", content: "Hello" }],
      });
    } catch (error) {
      if (error instanceof RateLimitError && attempt < maxRetries - 1) {
        // Exponential backoff: 1s, 2s, 4s
        await new Promise((resolve) =>
          setTimeout(resolve, Math.pow(2, attempt) * 1000)
        );
        continue;
      }
      throw error;
    }
  }
}

const result = await callWithRetry();
console.log(result);

Tip

Use exponential backoff when retrying. Start with a short delay (1 second) and double it with each retry. This prevents overwhelming the gateway when multiple clients hit the limit simultaneously.

Next steps

Was this page helpful?

Questions & Discussion