Embeddings & reranking

Generate text embeddings and rerank documents through the Prism Gateway.

About

Prism proxies embedding and reranking requests to any configured provider. The API follows the OpenAI format for embeddings and a similar format for reranking. All gateway features (caching, cost tracking, rate limiting, failover) apply to these endpoints the same way they apply to chat completions.


Endpoints

MethodPathDescription
POST/v1/embeddingsGenerate vector embeddings for text
POST/v1/rerankRerank documents by relevance to a query

Embeddings

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog",
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")
print(f"Cost: {response.prism.cost}")
from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog",
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")
import litellm

response = litellm.embedding(
    model="openai/text-embedding-3-small",
    input=["The quick brown fox jumps over the lazy dog"],
    api_base="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")
curl -X POST https://gateway.futureagi.com/v1/embeddings \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-3-small",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Batch embeddings

Pass an array to embed multiple texts in a single request. Each item in the response includes an index field matching its position in the input array.

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "First document about machine learning",
        "Second document about web development",
        "Third document about database design",
    ],
)

for item in response.data:
    print(f"Input {item.index}: {len(item.embedding)} dimensions")

Reduced dimensions

Some models support returning shorter vectors. Use the dimensions parameter to reduce the output size. Smaller vectors use less storage and are faster to compare, at the cost of some accuracy.

# Full dimensions (1536 for text-embedding-3-small)
full = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
)
print(f"Full: {len(full.data[0].embedding)} dims")

# Reduced to 512 dimensions
reduced = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
    dimensions=512,
)
print(f"Reduced: {len(reduced.data[0].embedding)} dims")

Note

The dimensions parameter is supported by OpenAI’s text-embedding-3-* models and some Cohere models. Older models like text-embedding-ada-002 do not support it.

Encoding format

By default, embeddings are returned as arrays of floats. For lower bandwidth, request base64 encoding:

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
    encoding_format="base64",
)
# response.data[0].embedding is a base64 string

Response format

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023, -0.0091, 0.0152, ...]
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9
  }
}

Reranking

Reranking takes a query and a list of documents, then returns the documents sorted by relevance. Use it after an initial retrieval step (vector search, BM25) to improve ranking quality before passing results to an LLM.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

documents = [
    "Machine learning is a branch of artificial intelligence.",
    "Dogs are popular household pets.",
    "Neural networks learn patterns from data.",
    "The weather in Paris is mild in spring.",
]

response = client.rerank.create(
    model="rerank-v3.5",
    query="What is machine learning?",
    documents=documents,
)

for result in response.results:
    print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")
    print(f"  {documents[result.index]}")
curl -X POST https://gateway.futureagi.com/v1/rerank \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rerank-v3.5",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a branch of artificial intelligence.",
      "Dogs are popular household pets.",
      "Neural networks learn patterns from data.",
      "The weather in Paris is mild in spring."
    ]
  }'

Parameters

ParameterTypeRequiredDescription
modelstringYesReranking model to use
querystringYesThe search query to rank against
documentsstring[]YesList of text documents to rerank
top_nintegerNoReturn only the top N results. Defaults to all documents.
return_documentsbooleanNoInclude the document text in the response. Default: false.

Limiting results

Use top_n to return only the most relevant documents:

response = client.rerank.create(
    model="rerank-v3.5",
    query="What is machine learning?",
    documents=["doc1...", "doc2...", "doc3...", "doc4..."],
    top_n=2,  # only return the 2 most relevant
)

Response format

{
  "results": [
    {
      "index": 0,
      "relevance_score": 0.9875,
      "document": "Machine learning is a branch of artificial intelligence."
    },
    {
      "index": 2,
      "relevance_score": 0.8432,
      "document": "Neural networks learn patterns from data."
    }
  ],
  "model": "rerank-v3.5",
  "usage": {
    "prompt_tokens": 42,
    "total_tokens": 42
  }
}

The document field is only present when return_documents=true.


Supported models

Embedding models

ProviderModelsDimensions
OpenAItext-embedding-3-small1536 (or custom via dimensions)
OpenAItext-embedding-3-large3072 (or custom via dimensions)
OpenAItext-embedding-ada-0021536
Googlegemini-embedding-001768
Cohereembed-english-v3.0, embed-multilingual-v3.01024

Reranking models

ProviderModels
Coherererank-v3.5, rerank-english-v3.0, rerank-multilingual-v3.0

Note

Available models depend on which providers are configured for your organization. Use GET /v1/models to see what’s available on your key.


RAG pipeline example

A typical retrieval-augmented generation pipeline using embeddings for search and reranking for precision:

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

# Step 1: Embed the query
query = "How does photosynthesis work?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query,
).data[0].embedding

# Step 2: Search your vector database (pseudo-code)
# candidates = vector_db.search(query_embedding, top_k=20)

# Step 3: Rerank the candidates for better precision
candidates = [
    "Photosynthesis converts light energy into chemical energy in plants.",
    "Plants use chlorophyll to absorb sunlight during photosynthesis.",
    "The mitochondria is the powerhouse of the cell.",
    "Carbon dioxide and water are inputs to the photosynthesis process.",
]

reranked = client.rerank.create(
    model="rerank-v3.5",
    query=query,
    documents=candidates,
    top_n=3,
)

# Step 4: Use the top results as context for the LLM
context = "\n".join(
    candidates[r.index] for r in reranked.results
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": query},
    ],
)

print(response.choices[0].message.content)

Caching embeddings

The same input always produces the same vector, so embeddings are a good fit for exact-match caching. With caching enabled, repeated inputs return instantly without calling the provider:

from prism import Prism, GatewayConfig, CacheConfig

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
    config=GatewayConfig(
        cache=CacheConfig(enabled=True, strategy="exact", ttl=86400),
    ),
)

# First call: cache miss, calls the provider
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
)
print(response.prism.cache_status)  # None or "miss"

# Second call with same input: cache hit, instant response
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
)
print(response.prism.cache_status)  # "hit_exact"
print(response.prism.cost)          # 0 (no provider call)

Next Steps

Was this page helpful?

Questions & Discussion