Async & batch

Run inference jobs asynchronously or process large batches of requests through the Prism Gateway.

About

Prism supports two modes for deferred processing: async inference sends a single request and returns a job ID you poll for the result, and batch processing submits many requests at once for bulk execution at lower cost.

Both modes support all the same models and parameters as synchronous chat completions.


Endpoints

MethodPathDescription
GET/v1/async/{job_id}Get async job status and result
DELETE/v1/async/{job_id}Cancel an async job
POST/v1/scheduledSchedule a completion for later
GET/v1/scheduledList scheduled jobs
GET/v1/scheduled/{job_id}Get a scheduled job
DELETE/v1/scheduled/{job_id}Cancel a scheduled job

Async inference

Send a chat completion request with async mode enabled. The gateway returns immediately with a job ID. Poll the job endpoint to get the result when it’s ready.

Sending an async request

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

# Send async request with x-prism-async header
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a detailed essay about climate change"}],
    extra_headers={"x-prism-async": "true"},
)

# Response contains the job ID
job_id = response.id
print(f"Job ID: {job_id}")
curl -X POST https://gateway.futureagi.com/v1/chat/completions \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -H "x-prism-async: true" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Write a detailed essay about climate change"}]
  }'

Polling for results

import time
import requests

headers = {"Authorization": "Bearer sk-prism-your-key"}

while True:
    resp = requests.get(
        f"https://gateway.futureagi.com/v1/async/{job_id}",
        headers=headers,
    )
    data = resp.json()

    if data["status"] == "completed":
        print(data["result"]["choices"][0]["message"]["content"])
        break
    elif data["status"] == "failed":
        print(f"Job failed: {data.get('error')}")
        break
    else:
        time.sleep(2)

Job statuses

StatusDescription
pendingJob is queued
runningJob is being processed
completedResult is ready
failedJob failed (check error field)
cancelledJob was cancelled

Scheduled completions

Schedule a request to run at a specific time. Useful for time-sensitive content generation or deferred workloads.

curl -X POST https://gateway.futureagi.com/v1/scheduled \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "scheduled_at": "2026-04-05T09:00:00Z",
    "request": {
      "model": "gpt-4o",
      "messages": [{"role": "user", "content": "Generate the daily summary report"}]
    }
  }'

Managing scheduled jobs

# List scheduled jobs
curl https://gateway.futureagi.com/v1/scheduled \
  -H "Authorization: Bearer sk-prism-your-key"

# Get a specific job
curl https://gateway.futureagi.com/v1/scheduled/job_123 \
  -H "Authorization: Bearer sk-prism-your-key"

# Cancel a scheduled job
curl -X DELETE https://gateway.futureagi.com/v1/scheduled/job_123 \
  -H "Authorization: Bearer sk-prism-your-key"

Batch processing

For high-volume workloads, the OpenAI Batch API lets you submit a file of requests and retrieve results when processing is complete. Batch requests typically run at lower cost (50% discount with OpenAI).

Creating a batch

from openai import OpenAI
import json

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

# 1. Create a JSONL file with requests
requests_data = [
    {
        "custom_id": "req-1",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": "Summarize: Machine learning is..."}],
        },
    },
    {
        "custom_id": "req-2",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": "Summarize: Neural networks are..."}],
        },
    },
]

with open("batch_input.jsonl", "w") as f:
    for req in requests_data:
        f.write(json.dumps(req) + "\n")

# 2. Upload the input file
input_file = client.files.create(
    file=open("batch_input.jsonl", "rb"),
    purpose="batch",
)

# 3. Create the batch
batch = client.batches.create(
    input_file_id=input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(f"Batch ID: {batch.id}, Status: {batch.status}")

Checking batch status

import time

while True:
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status} ({batch.request_counts.completed}/{batch.request_counts.total})")

    if batch.status == "completed":
        break
    elif batch.status in ("failed", "cancelled", "expired"):
        print(f"Batch ended: {batch.status}")
        break

    time.sleep(30)

Retrieving results

if batch.output_file_id:
    content = client.files.content(batch.output_file_id)
    results = content.text.strip().split("\n")

    for line in results:
        result = json.loads(line)
        print(f"{result['custom_id']}: {result['response']['body']['choices'][0]['message']['content'][:100]}")

When to use each mode

ModeBest forLatencyCost
SynchronousInteractive apps, real-time responsesLowestStandard
AsyncLong-running requests, fire-and-forgetMedium (poll)Standard
ScheduledTime-triggered jobs, deferred workScheduledStandard
BatchHigh-volume processing, data pipelinesHoursDiscounted (up to 50% off)

Next Steps

Was this page helpful?

Questions & Discussion