Chat completions

The primary endpoint for generating text with LLMs through Prism. Supports streaming, function calling, vision, and structured outputs.

About

POST /v1/chat/completions is the main endpoint. It works exactly like the OpenAI API — same request body, same response format. Prism adds routing, caching, guardrails, and cost tracking transparently, and supports streaming via SSE.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

print(response.choices[0].message.content)

from openai import OpenAI

# Same OpenAI SDK, just swap base_url and api_key
client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

print(response.choices[0].message.content)

import litellm

response = litellm.completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
)

print(response.choices[0].message.content)

curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Request body

All standard OpenAI chat completion parameters are supported:

Parameter	Type	Description
`model`	string	Required. The model to use (e.g., `gpt-4o`, `claude-sonnet-4-6`).
`messages`	array	Required. The conversation messages. See Message format below.
`temperature`	number	Sampling temperature (0-2).
`top_p`	number	Nucleus sampling (0-1).
`n`	integer	Number of completions to generate.
`stream`	boolean	Enable SSE streaming. See Streaming.
`stream_options`	object	`{include_usage: true}` to get token counts in the final chunk.
`stop`	string or array	Stop sequences.
`max_tokens`	integer	Maximum tokens to generate.
`max_completion_tokens`	integer	Max tokens for o1/o3-style models.
`presence_penalty`	number	Penalize repeated topics (-2 to 2).
`frequency_penalty`	number	Penalize repeated tokens (-2 to 2).
`logit_bias`	object	Token ID to bias value mapping.
`logprobs`	boolean	Return log probabilities.
`top_logprobs`	integer	Number of top log probs per token (0-20).
`user`	string	End-user ID for tracking and rate limiting.
`seed`	integer	Seed for reproducible outputs.
`tools`	array	Function definitions for tool/function calling.
`tool_choice`	string or object	`"auto"`, `"none"`, `"required"`, or a specific tool.
`response_format`	object	`{type: "json_object"}` or `{type: "json_schema", json_schema: {...}}`.
`modalities`	array	Output modalities, e.g., `["text", "audio"]`.
`audio`	object	Audio output config: `{voice: "alloy", format: "wav"}`.

Tip

Prism passes through unknown fields to the provider. Provider-specific parameters (like Anthropic’s thinking or any vendor extension) work without Prism needing to know about them.

Response body

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711000000,
  "model": "gpt-4o-2024-08-06",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

Field	Description
`choices[].finish_reason`	`"stop"` (natural end), `"length"` (hit max tokens), `"tool_calls"` (model wants to call a function), `"content_filter"` (blocked by provider)
`usage`	Token counts. Always present on non-streaming responses.

Streaming

Set stream: true to receive the response as Server-Sent Events (SSE). Each chunk arrives as a data: line:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":8,"total_tokens":33}}

data: [DONE]

The final chunk before [DONE] includes usage with token counts. Prism forces stream_options.include_usage = true on every streaming request so that cost tracking and credit deduction work correctly.

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

import litellm

response = litellm.completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Write a haiku about coding"}],
    "stream": true
  }'

Streaming behavior

Pre-request plugins (guardrails, rate limiting, etc.) run before the stream starts. If a guardrail blocks the request, you get a JSON error response, not a stream.
Post-response plugins (cost, logging, metrics) run after the final chunk, once token usage is known.
Cache: Streaming requests bypass the cache entirely, both on read and write.
Failover: Not supported mid-stream. If the provider fails after streaming starts, the error appears as an SSE data event.
Client disconnect: Post-plugins still run even if you disconnect early, so cost tracking stays accurate.

Function calling

Define tools in the request, and the model can choose to call them. The response will have finish_reason: "tool_calls" with the function name and arguments.

import json
from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        },
    }
]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

# First call: model decides to call a tool
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].finish_reason == "tool_calls":
    # Add the assistant's tool call to the conversation
    messages.append(response.choices[0].message)

    # Execute each tool call and add the result
    for tool_call in response.choices[0].message.tool_calls:
        args = json.loads(tool_call.function.arguments)
        result = {"temperature": "22°C", "condition": "Sunny"}  # your function here

        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    # Second call: model uses the tool result to respond
    final = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    )
    print(final.choices[0].message.content)

import json
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        },
    }
]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

# First call: model decides to call a tool
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

if response.choices[0].finish_reason == "tool_calls":
    messages.append(response.choices[0].message)

    for tool_call in response.choices[0].message.tool_calls:
        args = json.loads(tool_call.function.arguments)
        result = {"temperature": "22°C", "condition": "Sunny"}  # your function here

        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    # Second call: model uses the tool result to respond
    final = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    )
    print(final.choices[0].message.content)

import json
import litellm

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                },
                "required": ["location"],
            },
        },
    }
]

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

response = litellm.completion(
    model="openai/gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
)

if response.choices[0].finish_reason == "tool_calls":
    messages.append(response.choices[0].message)

    for tool_call in response.choices[0].message.tool_calls:
        result = {"temperature": "22°C", "condition": "Sunny"}
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result),
        })

    final = litellm.completion(
        model="openai/gpt-4o",
        messages=messages,
        tools=tools,
        api_key="sk-prism-your-key",
        base_url="https://gateway.futureagi.com/v1",
    )
    print(final.choices[0].message.content)

curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What'\''s the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'

Prism passes tools through to the provider without modification. All providers that support function calling (OpenAI, Anthropic, Gemini, etc.) work with the same tool definitions.

Vision (multimodal inputs)

Send images alongside text by using the content array format:

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
)

print(response.choices[0].message.content)

import litellm

response = litellm.completion(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com/v1",
)

print(response.choices[0].message.content)

curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Note

Not all models support vision. Use a model with image understanding capabilities (gpt-4o, claude-sonnet-4-6, gemini-2.0-flash, etc.).

Both HTTPS URLs and base64 data URIs (data:image/png;base64,...) are supported. Prism translates the content format to each provider’s native representation (Anthropic base64 blocks, Gemini inline parts, Bedrock image blocks).

Structured outputs

Force the model to return valid JSON matching a schema:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "List 3 European capitals"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "capitals",
            "schema": {
                "type": "object",
                "properties": {
                    "capitals": {
                        "type": "array",
                        "items": {"type": "string"},
                    }
                },
                "required": ["capitals"],
            },
        },
    },
)

Prism forwards response_format to the provider as-is. The provider handles constrained decoding. Use "type": "json_object" for simpler JSON without a schema.

Message format

Each message in the messages array has:

Field	Type	Description
`role`	string	`"system"`, `"user"`, `"assistant"`, or `"tool"`
`content`	string or array	Text string, or array of content parts for multimodal inputs
`name`	string	Optional sender name
`tool_calls`	array	Tool calls made by the assistant (on assistant messages)
`tool_call_id`	string	ID of the tool call this message responds to (on tool messages)

Response headers

Prism adds these headers to every response (streaming and non-streaming):

Header	Description
`x-prism-request-id`	Unique request ID for log correlation
`x-prism-provider`	Which provider handled the request (e.g., `openai`)
`x-prism-latency-ms`	Total latency in milliseconds
`x-prism-model-used`	Actual model returned by the provider
`x-prism-cost`	Estimated cost in USD
`x-prism-cache`	`hit` or `miss`
`x-prism-guardrail-triggered`	`true` if a guardrail fired
`x-prism-fallback-used`	`true` if a fallback provider or model was used
`x-prism-routing-strategy`	Which routing strategy was applied
`x-prism-credits-remaining`	Remaining credit balance (managed keys)
`x-ratelimit-limit-requests`	Rate limit ceiling
`x-ratelimit-remaining-requests`	Remaining requests in current window

Switching providers

Change the model name to route to a different provider. The request format stays identical:

# OpenAI
response = client.chat.completions.create(model="gpt-4o", messages=messages)

# Anthropic
response = client.chat.completions.create(model="claude-sonnet-4-6", messages=messages)

# Gemini
response = client.chat.completions.create(model="gemini-2.0-flash", messages=messages)

Prism translates the request to each provider’s native format. Your code doesn’t change.

Chat completions

About

Basic usage

Request body

Response body

Streaming

Streaming behavior

Function calling

Vision (multimodal inputs)

Structured outputs

Message format

Response headers

Switching providers

Next Steps

Routing strategies

Guardrails

Caching

Endpoints overview

Questions & Discussion

FutureAGI AI Assistant

About

Basic usage

Request body

Response body

Streaming

Streaming behavior

Function calling

Vision (multimodal inputs)

Structured outputs

Message format

Response headers

Switching providers

Next Steps

Routing strategies

Guardrails

Caching

Endpoints overview

Questions & Discussion