Async & batch
Run inference jobs asynchronously or process large batches of requests through the Prism Gateway.
About
Prism supports two modes for deferred processing: async inference sends a single request and returns a job ID you poll for the result, and batch processing submits many requests at once for bulk execution at lower cost.
Both modes support all the same models and parameters as synchronous chat completions.
Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /v1/async/{job_id} | Get async job status and result |
| DELETE | /v1/async/{job_id} | Cancel an async job |
| POST | /v1/scheduled | Schedule a completion for later |
| GET | /v1/scheduled | List scheduled jobs |
| GET | /v1/scheduled/{job_id} | Get a scheduled job |
| DELETE | /v1/scheduled/{job_id} | Cancel a scheduled job |
Async inference
Send a chat completion request with async mode enabled. The gateway returns immediately with a job ID. Poll the job endpoint to get the result when it’s ready.
Sending an async request
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
# Send async request with x-prism-async header
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a detailed essay about climate change"}],
extra_headers={"x-prism-async": "true"},
)
# Response contains the job ID
job_id = response.id
print(f"Job ID: {job_id}") curl -X POST https://gateway.futureagi.com/v1/chat/completions \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-H "x-prism-async: true" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Write a detailed essay about climate change"}]
}' Polling for results
import time
import requests
headers = {"Authorization": "Bearer sk-prism-your-key"}
while True:
resp = requests.get(
f"https://gateway.futureagi.com/v1/async/{job_id}",
headers=headers,
)
data = resp.json()
if data["status"] == "completed":
print(data["result"]["choices"][0]["message"]["content"])
break
elif data["status"] == "failed":
print(f"Job failed: {data.get('error')}")
break
else:
time.sleep(2)
Job statuses
| Status | Description |
|---|---|
pending | Job is queued |
running | Job is being processed |
completed | Result is ready |
failed | Job failed (check error field) |
cancelled | Job was cancelled |
Scheduled completions
Schedule a request to run at a specific time. Useful for time-sensitive content generation or deferred workloads.
curl -X POST https://gateway.futureagi.com/v1/scheduled \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"scheduled_at": "2026-04-05T09:00:00Z",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Generate the daily summary report"}]
}
}'
Managing scheduled jobs
# List scheduled jobs
curl https://gateway.futureagi.com/v1/scheduled \
-H "Authorization: Bearer sk-prism-your-key"
# Get a specific job
curl https://gateway.futureagi.com/v1/scheduled/job_123 \
-H "Authorization: Bearer sk-prism-your-key"
# Cancel a scheduled job
curl -X DELETE https://gateway.futureagi.com/v1/scheduled/job_123 \
-H "Authorization: Bearer sk-prism-your-key"
Batch processing
For high-volume workloads, the OpenAI Batch API lets you submit a file of requests and retrieve results when processing is complete. Batch requests typically run at lower cost (50% discount with OpenAI).
Creating a batch
from openai import OpenAI
import json
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
# 1. Create a JSONL file with requests
requests_data = [
{
"custom_id": "req-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Summarize: Machine learning is..."}],
},
},
{
"custom_id": "req-2",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Summarize: Neural networks are..."}],
},
},
]
with open("batch_input.jsonl", "w") as f:
for req in requests_data:
f.write(json.dumps(req) + "\n")
# 2. Upload the input file
input_file = client.files.create(
file=open("batch_input.jsonl", "rb"),
purpose="batch",
)
# 3. Create the batch
batch = client.batches.create(
input_file_id=input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(f"Batch ID: {batch.id}, Status: {batch.status}")
Checking batch status
import time
while True:
batch = client.batches.retrieve(batch.id)
print(f"Status: {batch.status} ({batch.request_counts.completed}/{batch.request_counts.total})")
if batch.status == "completed":
break
elif batch.status in ("failed", "cancelled", "expired"):
print(f"Batch ended: {batch.status}")
break
time.sleep(30)
Retrieving results
if batch.output_file_id:
content = client.files.content(batch.output_file_id)
results = content.text.strip().split("\n")
for line in results:
result = json.loads(line)
print(f"{result['custom_id']}: {result['response']['body']['choices'][0]['message']['content'][:100]}")
When to use each mode
| Mode | Best for | Latency | Cost |
|---|---|---|---|
| Synchronous | Interactive apps, real-time responses | Lowest | Standard |
| Async | Long-running requests, fire-and-forget | Medium (poll) | Standard |
| Scheduled | Time-triggered jobs, deferred work | Scheduled | Standard |
| Batch | High-volume processing, data pipelines | Hours | Discounted (up to 50% off) |