Media endpoints

Text-to-speech, speech-to-text, audio translation, and image generation through the Prism Gateway.

About

Prism proxies audio and image requests to any configured provider. The API follows the OpenAI format. All gateway features (caching, rate limiting, cost tracking, failover) apply to these endpoints.


Endpoints

MethodPathDescription
POST/v1/audio/speechText-to-speech
POST/v1/audio/speech/streamStreaming text-to-speech
POST/v1/audio/transcriptionsSpeech-to-text
POST/v1/audio/translationsTranslate audio to English
POST/v1/images/generationsGenerate images from prompts

Text-to-speech

Convert text to spoken audio. The response is raw audio bytes in the requested format.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

audio_bytes = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello! This is a test of text-to-speech through Prism.",
)

with open("output.mp3", "wb") as f:
    f.write(audio_bytes)
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello! This is a test of text-to-speech through Prism.",
)

response.stream_to_file("output.mp3")
curl -X POST https://gateway.futureagi.com/v1/audio/speech \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "Hello! This is a test of text-to-speech through Prism."
  }' \
  --output output.mp3

Parameters

ParameterTypeRequiredDefaultDescription
modelstringYes-TTS model (tts-1, tts-1-hd, gpt-4o-mini-tts)
inputstringYes-Text to convert to speech (max 4096 characters)
voicestringYes-Voice to use (alloy, echo, fable, onyx, nova, shimmer)
response_formatstringNomp3Output format: mp3, opus, aac, flac, wav, pcm
speedfloatNo1.0Speed multiplier (0.25 to 4.0)

HD quality

Use tts-1-hd for higher quality audio at the cost of higher latency:

audio_bytes = client.audio.speech.create(
    model="tts-1-hd",
    voice="nova",
    input="High quality audio output.",
    response_format="flac",
)

Speech-to-text (transcription)

Transcribe audio files to text. Supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

with open("recording.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )

print(transcription.text)
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

with open("recording.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )

print(transcription.text)
curl -X POST https://gateway.futureagi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -F file=@recording.mp3 \
  -F model=whisper-1

Parameters

ParameterTypeRequiredDescription
filefileYesAudio file to transcribe
modelstringYesTranscription model (whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe)
languagestringNoISO-639-1 language code (e.g. en, fr, de). Improves accuracy if you know the language.
promptstringNoHint text to guide the model’s style or continue a previous segment
response_formatstringNoOutput format: json, text, srt, verbose_json, vtt
temperaturefloatNoSampling temperature (0 to 1). Lower values are more deterministic.
timestamp_granularitiesstring[]Noword and/or segment level timestamps (requires verbose_json format)

Timestamps

Get word-level or segment-level timestamps with verbose_json:

with open("recording.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"],
    )

for word in transcription.words:
    print(f"[{word.start:.2f}s - {word.end:.2f}s] {word.word}")

Audio translation

Translate audio from any supported language to English text. Same API as transcription but always outputs English.

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

with open("french_audio.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=f,
    )

print(translation.text)  # English translation
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

with open("french_audio.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=f,
    )

print(translation.text)
curl -X POST https://gateway.futureagi.com/v1/audio/translations \
  -H "Authorization: Bearer sk-prism-your-key" \
  -F file=@french_audio.mp3 \
  -F model=whisper-1

Image generation

Generate images from text prompts.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene mountain lake at dawn, photorealistic",
    n=1,
    size="1024x1024",
)

print(response.data[0].url)
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene mountain lake at dawn, photorealistic",
    n=1,
    size="1024x1024",
)

print(response.data[0].url)
curl -X POST https://gateway.futureagi.com/v1/images/generations \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "dall-e-3",
    "prompt": "A serene mountain lake at dawn, photorealistic",
    "n": 1,
    "size": "1024x1024"
  }'

Parameters

ParameterTypeRequiredDefaultDescription
promptstringYes-Text description of the image to generate
modelstringNodall-e-3Image model (dall-e-2, dall-e-3, gpt-image-1)
nintegerNo1Number of images to generate (1 for DALL-E 3, 1-10 for DALL-E 2)
sizestringNo1024x1024Image size. DALL-E 3: 1024x1024, 1792x1024, 1024x1792. DALL-E 2: 256x256, 512x512, 1024x1024.
qualitystringNostandardstandard or hd (DALL-E 3 and gpt-image-1)
stylestringNovividvivid or natural (DALL-E 3 only)
response_formatstringNourlurl (temporary link) or b64_json (base64-encoded image data)

Get base64 data instead of URL

URLs expire after 1 hour. For persistent storage, request base64 data:

response = client.images.generate(
    model="dall-e-3",
    prompt="A watercolor painting of a cat",
    response_format="b64_json",
)

import base64

image_data = base64.b64decode(response.data[0].b64_json)
with open("cat.png", "wb") as f:
    f.write(image_data)

Response format

{
  "created": 1700000000,
  "data": [
    {
      "url": "https://oaidalleapiprodscus.blob.core.windows.net/...",
      "revised_prompt": "A serene mountain lake at dawn..."
    }
  ]
}

DALL-E 3 returns a revised_prompt field showing the expanded prompt the model actually used.


Supported models

Text-to-speech

ProviderModelsNotes
OpenAItts-1, tts-1-hd6 voices, mp3/opus/aac/flac/wav/pcm
OpenAIgpt-4o-mini-ttsNewer model, same voice options

Speech-to-text

ProviderModelsNotes
OpenAIwhisper-157 languages, timestamps, translation
OpenAIgpt-4o-transcribeNewer model with improved accuracy
OpenAIgpt-4o-mini-transcribeSmaller, faster transcription model

Image generation

ProviderModelsNotes
OpenAIdall-e-31024x1024, 1792x1024, 1024x1792
OpenAIdall-e-2256x256, 512x512, 1024x1024
OpenAIgpt-image-1Latest model. Returns b64_json only (no URL).

Note

Available models depend on which providers are configured for your organization. Use GET /v1/models to see what’s available on your key.


Next Steps

Was this page helpful?

Questions & Discussion