Chat completions
The primary endpoint for generating text with LLMs through Prism. Supports streaming, function calling, vision, and structured outputs.
About
POST /v1/chat/completions is the main endpoint. It works exactly like the OpenAI API — same request body, same response format. Prism adds routing, caching, guardrails, and cost tracking transparently, and supports streaming via SSE.
Basic usage
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
)
print(response.choices[0].message.content) from openai import OpenAI
# Same OpenAI SDK, just swap base_url and api_key
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
)
print(response.choices[0].message.content) import litellm
response = litellm.completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com/v1",
)
print(response.choices[0].message.content) curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}' Request body
All standard OpenAI chat completion parameters are supported:
| Parameter | Type | Description |
|---|---|---|
model | string | Required. The model to use (e.g., gpt-4o, claude-sonnet-4-6). |
messages | array | Required. The conversation messages. See Message format below. |
temperature | number | Sampling temperature (0-2). |
top_p | number | Nucleus sampling (0-1). |
n | integer | Number of completions to generate. |
stream | boolean | Enable SSE streaming. See Streaming. |
stream_options | object | {include_usage: true} to get token counts in the final chunk. |
stop | string or array | Stop sequences. |
max_tokens | integer | Maximum tokens to generate. |
max_completion_tokens | integer | Max tokens for o1/o3-style models. |
presence_penalty | number | Penalize repeated topics (-2 to 2). |
frequency_penalty | number | Penalize repeated tokens (-2 to 2). |
logit_bias | object | Token ID to bias value mapping. |
logprobs | boolean | Return log probabilities. |
top_logprobs | integer | Number of top log probs per token (0-20). |
user | string | End-user ID for tracking and rate limiting. |
seed | integer | Seed for reproducible outputs. |
tools | array | Function definitions for tool/function calling. |
tool_choice | string or object | "auto", "none", "required", or a specific tool. |
response_format | object | {type: "json_object"} or {type: "json_schema", json_schema: {...}}. |
modalities | array | Output modalities, e.g., ["text", "audio"]. |
audio | object | Audio output config: {voice: "alloy", format: "wav"}. |
Tip
Prism passes through unknown fields to the provider. Provider-specific parameters (like Anthropic’s thinking or any vendor extension) work without Prism needing to know about them.
Response body
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1711000000,
"model": "gpt-4o-2024-08-06",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
}
}
| Field | Description |
|---|---|
choices[].finish_reason | "stop" (natural end), "length" (hit max tokens), "tool_calls" (model wants to call a function), "content_filter" (blocked by provider) |
usage | Token counts. Always present on non-streaming responses. |
Streaming
Set stream: true to receive the response as Server-Sent Events (SSE). Each chunk arrives as a data: line:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33}}
data: [DONE]
The final chunk before [DONE] includes usage with token counts. Prism forces stream_options.include_usage = true on every streaming request so that cost tracking and credit deduction work correctly.
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about coding"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True) from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about coding"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True) import litellm
response = litellm.completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about coding"}],
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com/v1",
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True) curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Write a haiku about coding"}],
"stream": true
}' Streaming behavior
- Pre-request plugins (guardrails, rate limiting, etc.) run before the stream starts. If a guardrail blocks the request, you get a JSON error response, not a stream.
- Post-response plugins (cost, logging, metrics) run after the final chunk, once token usage is known.
- Cache: Streaming requests bypass the cache entirely, both on read and write.
- Failover: Not supported mid-stream. If the provider fails after streaming starts, the error appears as an SSE data event.
- Client disconnect: Post-plugins still run even if you disconnect early, so cost tracking stays accurate.
Function calling
Define tools in the request, and the model can choose to call them. The response will have finish_reason: "tool_calls" with the function name and arguments.
import json
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
},
"required": ["location"],
},
},
}
]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
# First call: model decides to call a tool
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
)
if response.choices[0].finish_reason == "tool_calls":
# Add the assistant's tool call to the conversation
messages.append(response.choices[0].message)
# Execute each tool call and add the result
for tool_call in response.choices[0].message.tool_calls:
args = json.loads(tool_call.function.arguments)
result = {"temperature": "22°C", "condition": "Sunny"} # your function here
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# Second call: model uses the tool result to respond
final = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print(final.choices[0].message.content) import json
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
},
"required": ["location"],
},
},
}
]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
# First call: model decides to call a tool
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
)
if response.choices[0].finish_reason == "tool_calls":
messages.append(response.choices[0].message)
for tool_call in response.choices[0].message.tool_calls:
args = json.loads(tool_call.function.arguments)
result = {"temperature": "22°C", "condition": "Sunny"} # your function here
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# Second call: model uses the tool result to respond
final = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print(final.choices[0].message.content) import json
import litellm
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
},
"required": ["location"],
},
},
}
]
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
response = litellm.completion(
model="openai/gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com/v1",
)
if response.choices[0].finish_reason == "tool_calls":
messages.append(response.choices[0].message)
for tool_call in response.choices[0].message.tool_calls:
result = {"temperature": "22°C", "condition": "Sunny"}
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
final = litellm.completion(
model="openai/gpt-4o",
messages=messages,
tools=tools,
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com/v1",
)
print(final.choices[0].message.content) curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What'\''s the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}],
"tool_choice": "auto"
}' Prism passes tools through to the provider without modification. All providers that support function calling (OpenAI, Anthropic, Gemini, etc.) work with the same tool definitions.
Vision (multimodal inputs)
Send images alongside text by using the content array format:
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
],
}
],
)
print(response.choices[0].message.content) import litellm
response = litellm.completion(
model="openai/gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
],
}
],
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com/v1",
)
print(response.choices[0].message.content) curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
]
}]
}' Note
Not all models support vision. Use a model with image understanding capabilities (gpt-4o, claude-sonnet-4-6, gemini-2.0-flash, etc.).
Both HTTPS URLs and base64 data URIs (data:image/png;base64,...) are supported. Prism translates the content format to each provider’s native representation (Anthropic base64 blocks, Gemini inline parts, Bedrock image blocks).
Structured outputs
Force the model to return valid JSON matching a schema:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "List 3 European capitals"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "capitals",
"schema": {
"type": "object",
"properties": {
"capitals": {
"type": "array",
"items": {"type": "string"},
}
},
"required": ["capitals"],
},
},
},
)
Prism forwards response_format to the provider as-is. The provider handles constrained decoding. Use "type": "json_object" for simpler JSON without a schema.
Message format
Each message in the messages array has:
| Field | Type | Description |
|---|---|---|
role | string | "system", "user", "assistant", or "tool" |
content | string or array | Text string, or array of content parts for multimodal inputs |
name | string | Optional sender name |
tool_calls | array | Tool calls made by the assistant (on assistant messages) |
tool_call_id | string | ID of the tool call this message responds to (on tool messages) |
Response headers
Prism adds these headers to every response (streaming and non-streaming):
| Header | Description |
|---|---|
x-prism-request-id | Unique request ID for log correlation |
x-prism-provider | Which provider handled the request (e.g., openai) |
x-prism-latency-ms | Total latency in milliseconds |
x-prism-model-used | Actual model returned by the provider |
x-prism-cost | Estimated cost in USD |
x-prism-cache | hit or miss |
x-prism-guardrail-triggered | true if a guardrail fired |
x-prism-fallback-used | true if a fallback provider or model was used |
x-prism-routing-strategy | Which routing strategy was applied |
x-prism-credits-remaining | Remaining credit balance (managed keys) |
x-ratelimit-limit-requests | Rate limit ceiling |
x-ratelimit-remaining-requests | Remaining requests in current window |
Switching providers
Change the model name to route to a different provider. The request format stays identical:
# OpenAI
response = client.chat.completions.create(model="gpt-4o", messages=messages)
# Anthropic
response = client.chat.completions.create(model="claude-sonnet-4-6", messages=messages)
# Gemini
response = client.chat.completions.create(model="gemini-2.0-flash", messages=messages)
Prism translates the request to each provider’s native format. Your code doesn’t change.