Embeddings & reranking
Generate text embeddings and rerank documents through the Prism Gateway.
About
Prism proxies embedding and reranking requests to any configured provider. The API follows the OpenAI format for embeddings and a similar format for reranking. All gateway features (caching, cost tracking, rate limiting, failover) apply to these endpoints the same way they apply to chat completions.
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/embeddings | Generate vector embeddings for text |
| POST | /v1/rerank | Rerank documents by relevance to a query |
Embeddings
Basic usage
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
response = client.embeddings.create(
model="text-embedding-3-small",
input="The quick brown fox jumps over the lazy dog",
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")
print(f"Cost: {response.prism.cost}") from openai import OpenAI
client = OpenAI(
base_url="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
response = client.embeddings.create(
model="text-embedding-3-small",
input="The quick brown fox jumps over the lazy dog",
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}") import litellm
response = litellm.embedding(
model="openai/text-embedding-3-small",
input=["The quick brown fox jumps over the lazy dog"],
api_base="https://gateway.futureagi.com/v1",
api_key="sk-prism-your-key",
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}") curl -X POST https://gateway.futureagi.com/v1/embeddings \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-3-small",
"input": "The quick brown fox jumps over the lazy dog"
}' Batch embeddings
Pass an array to embed multiple texts in a single request. Each item in the response includes an index field matching its position in the input array.
response = client.embeddings.create(
model="text-embedding-3-small",
input=[
"First document about machine learning",
"Second document about web development",
"Third document about database design",
],
)
for item in response.data:
print(f"Input {item.index}: {len(item.embedding)} dimensions")
Reduced dimensions
Some models support returning shorter vectors. Use the dimensions parameter to reduce the output size. Smaller vectors use less storage and are faster to compare, at the cost of some accuracy.
# Full dimensions (1536 for text-embedding-3-small)
full = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world",
)
print(f"Full: {len(full.data[0].embedding)} dims")
# Reduced to 512 dimensions
reduced = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world",
dimensions=512,
)
print(f"Reduced: {len(reduced.data[0].embedding)} dims")
Note
The dimensions parameter is supported by OpenAI’s text-embedding-3-* models and some Cohere models. Older models like text-embedding-ada-002 do not support it.
Encoding format
By default, embeddings are returned as arrays of floats. For lower bandwidth, request base64 encoding:
response = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world",
encoding_format="base64",
)
# response.data[0].embedding is a base64 string
Response format
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0091, 0.0152, ...]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 9,
"total_tokens": 9
}
}
Reranking
Reranking takes a query and a list of documents, then returns the documents sorted by relevance. Use it after an initial retrieval step (vector search, BM25) to improve ranking quality before passing results to an LLM.
Basic usage
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
documents = [
"Machine learning is a branch of artificial intelligence.",
"Dogs are popular household pets.",
"Neural networks learn patterns from data.",
"The weather in Paris is mild in spring.",
]
response = client.rerank.create(
model="rerank-v3.5",
query="What is machine learning?",
documents=documents,
)
for result in response.results:
print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")
print(f" {documents[result.index]}") curl -X POST https://gateway.futureagi.com/v1/rerank \
-H "Authorization: Bearer sk-prism-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "rerank-v3.5",
"query": "What is machine learning?",
"documents": [
"Machine learning is a branch of artificial intelligence.",
"Dogs are popular household pets.",
"Neural networks learn patterns from data.",
"The weather in Paris is mild in spring."
]
}' Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Reranking model to use |
query | string | Yes | The search query to rank against |
documents | string[] | Yes | List of text documents to rerank |
top_n | integer | No | Return only the top N results. Defaults to all documents. |
return_documents | boolean | No | Include the document text in the response. Default: false. |
Limiting results
Use top_n to return only the most relevant documents:
response = client.rerank.create(
model="rerank-v3.5",
query="What is machine learning?",
documents=["doc1...", "doc2...", "doc3...", "doc4..."],
top_n=2, # only return the 2 most relevant
)
Response format
{
"results": [
{
"index": 0,
"relevance_score": 0.9875,
"document": "Machine learning is a branch of artificial intelligence."
},
{
"index": 2,
"relevance_score": 0.8432,
"document": "Neural networks learn patterns from data."
}
],
"model": "rerank-v3.5",
"usage": {
"prompt_tokens": 42,
"total_tokens": 42
}
}
The document field is only present when return_documents=true.
Supported models
Embedding models
| Provider | Models | Dimensions |
|---|---|---|
| OpenAI | text-embedding-3-small | 1536 (or custom via dimensions) |
| OpenAI | text-embedding-3-large | 3072 (or custom via dimensions) |
| OpenAI | text-embedding-ada-002 | 1536 |
gemini-embedding-001 | 768 | |
| Cohere | embed-english-v3.0, embed-multilingual-v3.0 | 1024 |
Reranking models
| Provider | Models |
|---|---|
| Cohere | rerank-v3.5, rerank-english-v3.0, rerank-multilingual-v3.0 |
Note
Available models depend on which providers are configured for your organization. Use GET /v1/models to see what’s available on your key.
RAG pipeline example
A typical retrieval-augmented generation pipeline using embeddings for search and reranking for precision:
from prism import Prism
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
)
# Step 1: Embed the query
query = "How does photosynthesis work?"
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query,
).data[0].embedding
# Step 2: Search your vector database (pseudo-code)
# candidates = vector_db.search(query_embedding, top_k=20)
# Step 3: Rerank the candidates for better precision
candidates = [
"Photosynthesis converts light energy into chemical energy in plants.",
"Plants use chlorophyll to absorb sunlight during photosynthesis.",
"The mitochondria is the powerhouse of the cell.",
"Carbon dioxide and water are inputs to the photosynthesis process.",
]
reranked = client.rerank.create(
model="rerank-v3.5",
query=query,
documents=candidates,
top_n=3,
)
# Step 4: Use the top results as context for the LLM
context = "\n".join(
candidates[r.index] for r in reranked.results
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": query},
],
)
print(response.choices[0].message.content)
Caching embeddings
The same input always produces the same vector, so embeddings are a good fit for exact-match caching. With caching enabled, repeated inputs return instantly without calling the provider:
from prism import Prism, GatewayConfig, CacheConfig
client = Prism(
api_key="sk-prism-your-key",
base_url="https://gateway.futureagi.com",
config=GatewayConfig(
cache=CacheConfig(enabled=True, strategy="exact", ttl=86400),
),
)
# First call: cache miss, calls the provider
response = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world",
)
print(response.prism.cache_status) # None or "miss"
# Second call with same input: cache hit, instant response
response = client.embeddings.create(
model="text-embedding-3-small",
input="Hello world",
)
print(response.prism.cache_status) # "hit_exact"
print(response.prism.cost) # 0 (no provider call)