Media endpoints
Text-to-speech, speech-to-text, audio translation, and image generation through the Prism Gateway.
About
Prism proxies audio and image requests to any configured provider. The API follows the OpenAI format. All gateway features (caching, rate limiting, cost tracking, failover) apply to these endpoints.
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/audio/speech | Text-to-speech |
| POST | /v1/audio/speech/stream | Streaming text-to-speech |
| POST | /v1/audio/transcriptions | Speech-to-text |
| POST | /v1/audio/translations | Translate audio to English |
| POST | /v1/images/generations | Generate images from prompts |
Text-to-speech
Convert text to spoken audio. The response is raw audio bytes in the requested format.
Basic usage
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
audio_bytes = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello! This is a test of text-to-speech through Prism.",
)
with open("output.mp3", "wb") as f:
f.write(audio_bytes) from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Hello! This is a test of text-to-speech through Prism.",
)
response.stream_to_file("output.mp3") curl -X POST https://gateway.futureagi.com/v1/audio/speech \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "Hello! This is a test of text-to-speech through Prism."
}' \
--output output.mp3 Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | - | TTS model (tts-1, tts-1-hd, gpt-4o-mini-tts) |
input | string | Yes | - | Text to convert to speech (max 4096 characters) |
voice | string | Yes | - | Voice to use (alloy, echo, fable, onyx, nova, shimmer) |
response_format | string | No | mp3 | Output format: mp3, opus, aac, flac, wav, pcm |
speed | float | No | 1.0 | Speed multiplier (0.25 to 4.0) |
HD quality
Use tts-1-hd for higher quality audio at the cost of higher latency:
audio_bytes = client.audio.speech.create(
model="tts-1-hd",
voice="nova",
input="High quality audio output.",
response_format="flac",
)
Speech-to-text (transcription)
Transcribe audio files to text. Supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats.
Basic usage
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
with open("recording.mp3", "rb") as f:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=f,
)
print(transcription.text) from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
with open("recording.mp3", "rb") as f:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=f,
)
print(transcription.text) curl -X POST https://gateway.futureagi.com/v1/audio/transcriptions \
-H "Authorization: Bearer sk-prism-your-key" \
-F file=@recording.mp3 \
-F model=whisper-1 Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
file | file | Yes | Audio file to transcribe |
model | string | Yes | Transcription model (whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe) |
language | string | No | ISO-639-1 language code (e.g. en, fr, de). Improves accuracy if you know the language. |
prompt | string | No | Hint text to guide the model’s style or continue a previous segment |
response_format | string | No | Output format: json, text, srt, verbose_json, vtt |
temperature | float | No | Sampling temperature (0 to 1). Lower values are more deterministic. |
timestamp_granularities | string[] | No | word and/or segment level timestamps (requires verbose_json format) |
Timestamps
Get word-level or segment-level timestamps with verbose_json:
with open("recording.mp3", "rb") as f:
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json",
timestamp_granularities=["word", "segment"],
)
for word in transcription.words:
print(f"[{word.start:.2f}s - {word.end:.2f}s] {word.word}")
Audio translation
Translate audio from any supported language to English text. Same API as transcription but always outputs English.
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
with open("french_audio.mp3", "rb") as f:
translation = client.audio.translations.create(
model="whisper-1",
file=f,
)
print(translation.text) # English translation from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
with open("french_audio.mp3", "rb") as f:
translation = client.audio.translations.create(
model="whisper-1",
file=f,
)
print(translation.text) curl -X POST https://gateway.futureagi.com/v1/audio/translations \
-H "Authorization: Bearer sk-prism-your-key" \
-F file=@french_audio.mp3 \
-F model=whisper-1 Image generation
Generate images from text prompts.
Basic usage
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
response = client.images.generate(
model="dall-e-3",
prompt="A serene mountain lake at dawn, photorealistic",
n=1,
size="1024x1024",
)
print(response.data[0].url) from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
response = client.images.generate(
model="dall-e-3",
prompt="A serene mountain lake at dawn, photorealistic",
n=1,
size="1024x1024",
)
print(response.data[0].url) curl -X POST https://gateway.futureagi.com/v1/images/generations \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "dall-e-3",
"prompt": "A serene mountain lake at dawn, photorealistic",
"n": 1,
"size": "1024x1024"
}' Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
prompt | string | Yes | - | Text description of the image to generate |
model | string | No | dall-e-3 | Image model (dall-e-2, dall-e-3, gpt-image-1) |
n | integer | No | 1 | Number of images to generate (1 for DALL-E 3, 1-10 for DALL-E 2) |
size | string | No | 1024x1024 | Image size. DALL-E 3: 1024x1024, 1792x1024, 1024x1792. DALL-E 2: 256x256, 512x512, 1024x1024. |
quality | string | No | standard | standard or hd (DALL-E 3 and gpt-image-1) |
style | string | No | vivid | vivid or natural (DALL-E 3 only) |
response_format | string | No | url | url (temporary link) or b64_json (base64-encoded image data) |
Get base64 data instead of URL
URLs expire after 1 hour. For persistent storage, request base64 data:
response = client.images.generate(
model="dall-e-3",
prompt="A watercolor painting of a cat",
response_format="b64_json",
)
import base64
image_data = base64.b64decode(response.data[0].b64_json)
with open("cat.png", "wb") as f:
f.write(image_data)
Response format
{
"created": 1700000000,
"data": [
{
"url": "https://oaidalleapiprodscus.blob.core.windows.net/...",
"revised_prompt": "A serene mountain lake at dawn..."
}
]
}
DALL-E 3 returns a revised_prompt field showing the expanded prompt the model actually used.
Supported models
Text-to-speech
| Provider | Models | Notes |
|---|---|---|
| OpenAI | tts-1, tts-1-hd | 6 voices, mp3/opus/aac/flac/wav/pcm |
| OpenAI | gpt-4o-mini-tts | Newer model, same voice options |
Speech-to-text
| Provider | Models | Notes |
|---|---|---|
| OpenAI | whisper-1 | 57 languages, timestamps, translation |
| OpenAI | gpt-4o-transcribe | Newer model with improved accuracy |
| OpenAI | gpt-4o-mini-transcribe | Smaller, faster transcription model |
Image generation
| Provider | Models | Notes |
|---|---|---|
| OpenAI | dall-e-3 | 1024x1024, 1792x1024, 1024x1792 |
| OpenAI | dall-e-2 | 256x256, 512x512, 1024x1024 |
| OpenAI | gpt-image-1 | Latest model. Returns b64_json only (no URL). |
Note
Available models depend on which providers are configured for your organization. Use GET /v1/models to see what’s available on your key.