Chat completions

The primary endpoint for generating text with LLMs through Prism. Supports streaming, function calling, vision, and structured outputs.

About

POST /v1/chat/completions is the main endpoint. It works exactly like the OpenAI API — same request body, same response format. Prism adds routing, caching, guardrails, and cost tracking transparently, and supports streaming via SSE.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

print(response.choices[0].message.content)
from openai import OpenAI

# Same OpenAI SDK, just swap base_url and api_key
client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

print(response.choices[0].message.content)
import litellm

response = litellm.completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
)

print(response.choices[0].message.content)
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Request body

All standard OpenAI chat completion parameters are supported:

ParameterTypeDescription
modelstringRequired. The model to use (e.g., gpt-4o, claude-sonnet-4-6).
messagesarrayRequired. The conversation messages. See Message format below.
temperaturenumberSampling temperature (0-2).
top_pnumberNucleus sampling (0-1).
nintegerNumber of completions to generate.
streambooleanEnable SSE streaming. See Streaming.
stream_optionsobject{include_usage: true} to get token counts in the final chunk.
stopstring or arrayStop sequences.
max_tokensintegerMaximum tokens to generate.
max_completion_tokensintegerMax tokens for o1/o3-style models.
presence_penaltynumberPenalize repeated topics (-2 to 2).
frequency_penaltynumberPenalize repeated tokens (-2 to 2).
logit_biasobjectToken ID to bias value mapping.
logprobsbooleanReturn log probabilities.
top_logprobsintegerNumber of top log probs per token (0-20).
userstringEnd-user ID for tracking and rate limiting.
seedintegerSeed for reproducible outputs.
toolsarrayFunction definitions for tool/function calling.
tool_choicestring or object"auto", "none", "required", or a specific tool.
response_formatobject{type: "json_object"} or {type: "json_schema", json_schema: {...}}.
modalitiesarrayOutput modalities, e.g., ["text", "audio"].
audioobjectAudio output config: {voice: "alloy", format: "wav"}.

Tip

Prism passes through unknown fields to the provider. Provider-specific parameters (like Anthropic’s thinking or any vendor extension) work without Prism needing to know about them.


Response body

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711000000,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}
FieldDescription
choices[].finish_reason"stop" (natural end), "length" (hit max tokens), "tool_calls" (model wants to call a function), "content_filter" (blocked by provider)
usageToken counts. Always present on non-streaming responses.

Streaming

Set stream: true to receive the response as Server-Sent Events (SSE). Each chunk arrives as a data: line:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33}}

data: [DONE]

The final chunk before [DONE] includes usage with token counts. Prism forces stream_options.include_usage = true on every streaming request so that cost tracking and credit deduction work correctly.

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
import litellm

response = litellm.completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Write a haiku about coding"}],
    "stream": true
  }'

Streaming behavior

  • Pre-request plugins (guardrails, rate limiting, etc.) run before the stream starts. If a guardrail blocks the request, you get a JSON error response, not a stream.
  • Post-response plugins (cost, logging, metrics) run after the final chunk, once token usage is known.
  • Cache: Streaming requests bypass the cache entirely, both on read and write.
  • Failover: Not supported mid-stream. If the provider fails after streaming starts, the error appears as an SSE data event.
  • Client disconnect: Post-plugins still run even if you disconnect early, so cost tracking stays accurate.

Function calling

Define tools in the request, and the model can choose to call them. The response will have finish_reason: "tool_calls" with the function name and arguments.

import json
from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        },
    }
]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

# First call: model decides to call a tool
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].finish_reason == "tool_calls":
    # Add the assistant's tool call to the conversation
    messages.append(response.choices[0].message)

    # Execute each tool call and add the result
    for tool_call in response.choices[0].message.tool_calls:
        args = json.loads(tool_call.function.arguments)
        result = {"temperature": "22°C", "condition": "Sunny"}  # your function here

        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    # Second call: model uses the tool result to respond
    final = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    )
    print(final.choices[0].message.content)
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        },
    }
]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

# First call: model decides to call a tool
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].finish_reason == "tool_calls":
    messages.append(response.choices[0].message)

    for tool_call in response.choices[0].message.tool_calls:
        args = json.loads(tool_call.function.arguments)
        result = {"temperature": "22°C", "condition": "Sunny"}  # your function here

        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    # Second call: model uses the tool result to respond
    final = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    )
    print(final.choices[0].message.content)
import json
import litellm

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        },
    }
]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

response = litellm.completion(
    model="openai/gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
)

if response.choices[0].finish_reason == "tool_calls":
    messages.append(response.choices[0].message)

    for tool_call in response.choices[0].message.tool_calls:
        result = {"temperature": "22°C", "condition": "Sunny"}
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    final = litellm.completion(
        model="openai/gpt-4o",
        messages=messages,
        tools=tools,
        api_key="sk-prism-your-key",
        base_url="https://gateway.futureagi.com/v1",
    )
    print(final.choices[0].message.content)
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What'\''s the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Prism passes tools through to the provider without modification. All providers that support function calling (OpenAI, Anthropic, Gemini, etc.) work with the same tool definitions.


Vision (multimodal inputs)

Send images alongside text by using the content array format:

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
)

print(response.choices[0].message.content)
import litellm

response = litellm.completion(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
)

print(response.choices[0].message.content)
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Note

Not all models support vision. Use a model with image understanding capabilities (gpt-4o, claude-sonnet-4-6, gemini-2.0-flash, etc.).

Both HTTPS URLs and base64 data URIs (data:image/png;base64,...) are supported. Prism translates the content format to each provider’s native representation (Anthropic base64 blocks, Gemini inline parts, Bedrock image blocks).


Structured outputs

Force the model to return valid JSON matching a schema:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "List 3 European capitals"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "capitals",
            "schema": {
                "type": "object",
                "properties": {
                    "capitals": {
                        "type": "array",
                        "items": {"type": "string"},
                    }
                },
                "required": ["capitals"],
            },
        },
    },
)

Prism forwards response_format to the provider as-is. The provider handles constrained decoding. Use "type": "json_object" for simpler JSON without a schema.


Message format

Each message in the messages array has:

FieldTypeDescription
rolestring"system", "user", "assistant", or "tool"
contentstring or arrayText string, or array of content parts for multimodal inputs
namestringOptional sender name
tool_callsarrayTool calls made by the assistant (on assistant messages)
tool_call_idstringID of the tool call this message responds to (on tool messages)

Response headers

Prism adds these headers to every response (streaming and non-streaming):

HeaderDescription
x-prism-request-idUnique request ID for log correlation
x-prism-providerWhich provider handled the request (e.g., openai)
x-prism-latency-msTotal latency in milliseconds
x-prism-model-usedActual model returned by the provider
x-prism-costEstimated cost in USD
x-prism-cachehit or miss
x-prism-guardrail-triggeredtrue if a guardrail fired
x-prism-fallback-usedtrue if a fallback provider or model was used
x-prism-routing-strategyWhich routing strategy was applied
x-prism-credits-remainingRemaining credit balance (managed keys)
x-ratelimit-limit-requestsRate limit ceiling
x-ratelimit-remaining-requestsRemaining requests in current window

Switching providers

Change the model name to route to a different provider. The request format stays identical:

# OpenAI
response = client.chat.completions.create(model="gpt-4o", messages=messages)

# Anthropic
response = client.chat.completions.create(model="claude-sonnet-4-6", messages=messages)

# Gemini
response = client.chat.completions.create(model="gemini-2.0-flash", messages=messages)

Prism translates the request to each provider’s native format. Your code doesn’t change.


Next Steps

Was this page helpful?

Questions & Discussion