Media endpoints

Text-to-speech, speech-to-text, audio translation, and image generation through the Prism Gateway.

About

Prism proxies audio and image requests to any configured provider. The API follows the OpenAI format. All gateway features (caching, rate limiting, cost tracking, failover) apply to these endpoints.

Endpoints

Method	Path	Description
POST	`/v1/audio/speech`	Text-to-speech
POST	`/v1/audio/speech/stream`	Streaming text-to-speech
POST	`/v1/audio/transcriptions`	Speech-to-text
POST	`/v1/audio/translations`	Translate audio to English
POST	`/v1/images/generations`	Generate images from prompts

Text-to-speech

Convert text to spoken audio. The response is raw audio bytes in the requested format.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

audio_bytes = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello! This is a test of text-to-speech through Prism.",
)

with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Hello! This is a test of text-to-speech through Prism.",
)

response.stream_to_file("output.mp3")

curl -X POST https://gateway.futureagi.com/v1/audio/speech \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "voice": "alloy",
    "input": "Hello! This is a test of text-to-speech through Prism."
  }' \
  --output output.mp3

Parameters

Parameter	Type	Required	Default	Description
`model`	string	Yes	-	TTS model (`tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`)
`input`	string	Yes	-	Text to convert to speech (max 4096 characters)
`voice`	string	Yes	-	Voice to use (`alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`)
`response_format`	string	No	`mp3`	Output format: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`
`speed`	float	No	`1.0`	Speed multiplier (0.25 to 4.0)

HD quality

Use tts-1-hd for higher quality audio at the cost of higher latency:

audio_bytes = client.audio.speech.create(
    model="tts-1-hd",
    voice="nova",
    input="High quality audio output.",
    response_format="flac",
)

Speech-to-text (transcription)

Transcribe audio files to text. Supports mp3, mp4, mpeg, mpga, m4a, wav, and webm formats.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

with open("recording.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )

print(transcription.text)

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

with open("recording.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )

print(transcription.text)

curl -X POST https://gateway.futureagi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -F file=@recording.mp3 \
  -F model=whisper-1

Parameters

Parameter	Type	Required	Description
`file`	file	Yes	Audio file to transcribe
`model`	string	Yes	Transcription model (`whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`)
`language`	string	No	ISO-639-1 language code (e.g. `en`, `fr`, `de`). Improves accuracy if you know the language.
`prompt`	string	No	Hint text to guide the model’s style or continue a previous segment
`response_format`	string	No	Output format: `json`, `text`, `srt`, `verbose_json`, `vtt`
`temperature`	float	No	Sampling temperature (0 to 1). Lower values are more deterministic.
`timestamp_granularities`	string[]	No	`word` and/or `segment` level timestamps (requires `verbose_json` format)

Timestamps

Get word-level or segment-level timestamps with verbose_json:

with open("recording.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="verbose_json",
        timestamp_granularities=["word", "segment"],
    )

for word in transcription.words:
    print(f"[{word.start:.2f}s - {word.end:.2f}s] {word.word}")

Audio translation

Translate audio from any supported language to English text. Same API as transcription but always outputs English.

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

with open("french_audio.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=f,
    )

print(translation.text)  # English translation

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

with open("french_audio.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=f,
    )

print(translation.text)

curl -X POST https://gateway.futureagi.com/v1/audio/translations \
  -H "Authorization: Bearer sk-prism-your-key" \
  -F file=@french_audio.mp3 \
  -F model=whisper-1

Image generation

Generate images from text prompts.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene mountain lake at dawn, photorealistic",
    n=1,
    size="1024x1024",
)

print(response.data[0].url)

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A serene mountain lake at dawn, photorealistic",
    n=1,
    size="1024x1024",
)

print(response.data[0].url)

curl -X POST https://gateway.futureagi.com/v1/images/generations \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "dall-e-3",
    "prompt": "A serene mountain lake at dawn, photorealistic",
    "n": 1,
    "size": "1024x1024"
  }'

Parameters

Parameter	Type	Required	Default	Description
`prompt`	string	Yes	-	Text description of the image to generate
`model`	string	No	`dall-e-3`	Image model (`dall-e-2`, `dall-e-3`, `gpt-image-1`)
`n`	integer	No	`1`	Number of images to generate (1 for DALL-E 3, 1-10 for DALL-E 2)
`size`	string	No	`1024x1024`	Image size. DALL-E 3: `1024x1024`, `1792x1024`, `1024x1792`. DALL-E 2: `256x256`, `512x512`, `1024x1024`.
`quality`	string	No	`standard`	`standard` or `hd` (DALL-E 3 and `gpt-image-1`)
`style`	string	No	`vivid`	`vivid` or `natural` (DALL-E 3 only)
`response_format`	string	No	`url`	`url` (temporary link) or `b64_json` (base64-encoded image data)

Get base64 data instead of URL

URLs expire after 1 hour. For persistent storage, request base64 data:

response = client.images.generate(
    model="dall-e-3",
    prompt="A watercolor painting of a cat",
    response_format="b64_json",
)

import base64

image_data = base64.b64decode(response.data[0].b64_json)
with open("cat.png", "wb") as f:
    f.write(image_data)

Response format

{
  "created": 1700000000,
  "data": [
    {
      "url": "https://oaidalleapiprodscus.blob.core.windows.net/...",
      "revised_prompt": "A serene mountain lake at dawn..."
    }
  ]
}

DALL-E 3 returns a revised_prompt field showing the expanded prompt the model actually used.

Supported models

Text-to-speech

Provider	Models	Notes
OpenAI	`tts-1`, `tts-1-hd`	6 voices, mp3/opus/aac/flac/wav/pcm
OpenAI	`gpt-4o-mini-tts`	Newer model, same voice options

Speech-to-text

Provider	Models	Notes
OpenAI	`whisper-1`	57 languages, timestamps, translation
OpenAI	`gpt-4o-transcribe`	Newer model with improved accuracy
OpenAI	`gpt-4o-mini-transcribe`	Smaller, faster transcription model

Image generation

Provider	Models	Notes
OpenAI	`dall-e-3`	1024x1024, 1792x1024, 1024x1792
OpenAI	`dall-e-2`	256x256, 512x512, 1024x1024
OpenAI	`gpt-image-1`	Latest model. Returns `b64_json` only (no URL).

Note

Available models depend on which providers are configured for your organization. Use GET /v1/models to see what’s available on your key.

Next Steps

Chat completions

Text generation with streaming and function calling

Embeddings & reranking

Vector embeddings and document reranking

Caching

Cache responses to reduce cost and latency

Request & response headers

Full reference for x-prism-* headers

Was this page helpful?

FutureAGI AI Assistant

About

Endpoints

Text-to-speech

Basic usage

Parameters

HD quality

Speech-to-text (transcription)

Basic usage

Parameters

Timestamps

Audio translation

Image generation

Basic usage

Parameters

Get base64 data instead of URL

Response format

Supported models

Text-to-speech

Speech-to-text

Image generation

Next Steps

Chat completions

Embeddings & reranking

Caching

Request & response headers

Questions & Discussion