Embeddings & reranking

Generate text embeddings and rerank documents through the Prism Gateway.

About

Prism proxies embedding and reranking requests to any configured provider. The API follows the OpenAI format for embeddings and a similar format for reranking. All gateway features (caching, cost tracking, rate limiting, failover) apply to these endpoints the same way they apply to chat completions.

Endpoints

Method	Path	Description
POST	`/v1/embeddings`	Generate vector embeddings for text
POST	`/v1/rerank`	Rerank documents by relevance to a query

Embeddings

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog",
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")
print(f"Cost: {response.prism.cost}")

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog",
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")

import litellm

response = litellm.embedding(
    model="openai/text-embedding-3-small",
    input=["The quick brown fox jumps over the lazy dog"],
    api_base="https://gateway.futureagi.com/v1",
    api_key="sk-prism-your-key",
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")

curl -X POST https://gateway.futureagi.com/v1/embeddings \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-3-small",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Batch embeddings

Pass an array to embed multiple texts in a single request. Each item in the response includes an index field matching its position in the input array.

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "First document about machine learning",
        "Second document about web development",
        "Third document about database design",
    ],
)

for item in response.data:
    print(f"Input {item.index}: {len(item.embedding)} dimensions")

Reduced dimensions

Some models support returning shorter vectors. Use the dimensions parameter to reduce the output size. Smaller vectors use less storage and are faster to compare, at the cost of some accuracy.

# Full dimensions (1536 for text-embedding-3-small)
full = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
)
print(f"Full: {len(full.data[0].embedding)} dims")

# Reduced to 512 dimensions
reduced = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
    dimensions=512,
)
print(f"Reduced: {len(reduced.data[0].embedding)} dims")

Note

The dimensions parameter is supported by OpenAI’s text-embedding-3-* models and some Cohere models. Older models like text-embedding-ada-002 do not support it.

Encoding format

By default, embeddings are returned as arrays of floats. For lower bandwidth, request base64 encoding:

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
    encoding_format="base64",
)
# response.data[0].embedding is a base64 string

Response format

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0023, -0.0091, 0.0152, ...]
    }
  ],
  "model": "text-embedding-3-small",
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9
  }
}

Reranking

Reranking takes a query and a list of documents, then returns the documents sorted by relevance. Use it after an initial retrieval step (vector search, BM25) to improve ranking quality before passing results to an LLM.

Basic usage

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

documents = [
    "Machine learning is a branch of artificial intelligence.",
    "Dogs are popular household pets.",
    "Neural networks learn patterns from data.",
    "The weather in Paris is mild in spring.",
]

response = client.rerank.create(
    model="rerank-v3.5",
    query="What is machine learning?",
    documents=documents,
)

for result in response.results:
    print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")
    print(f"  {documents[result.index]}")

curl -X POST https://gateway.futureagi.com/v1/rerank \
  -H "Authorization: Bearer sk-prism-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "rerank-v3.5",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a branch of artificial intelligence.",
      "Dogs are popular household pets.",
      "Neural networks learn patterns from data.",
      "The weather in Paris is mild in spring."
    ]
  }'

Parameters

Parameter	Type	Required	Description
`model`	string	Yes	Reranking model to use
`query`	string	Yes	The search query to rank against
`documents`	string[]	Yes	List of text documents to rerank
`top_n`	integer	No	Return only the top N results. Defaults to all documents.
`return_documents`	boolean	No	Include the document text in the response. Default: `false`.

Limiting results

Use top_n to return only the most relevant documents:

response = client.rerank.create(
    model="rerank-v3.5",
    query="What is machine learning?",
    documents=["doc1...", "doc2...", "doc3...", "doc4..."],
    top_n=2,  # only return the 2 most relevant
)

Response format

{
  "results": [
    {
      "index": 0,
      "relevance_score": 0.9875,
      "document": "Machine learning is a branch of artificial intelligence."
    },
    {
      "index": 2,
      "relevance_score": 0.8432,
      "document": "Neural networks learn patterns from data."
    }
  ],
  "model": "rerank-v3.5",
  "usage": {
    "prompt_tokens": 42,
    "total_tokens": 42
  }
}

The document field is only present when return_documents=true.

Supported models

Embedding models

Provider	Models	Dimensions
OpenAI	`text-embedding-3-small`	1536 (or custom via `dimensions`)
OpenAI	`text-embedding-3-large`	3072 (or custom via `dimensions`)
OpenAI	`text-embedding-ada-002`	1536
Google	`gemini-embedding-001`	768
Cohere	`embed-english-v3.0`, `embed-multilingual-v3.0`	1024

Reranking models

Provider	Models
Cohere	`rerank-v3.5`, `rerank-english-v3.0`, `rerank-multilingual-v3.0`

Note

Available models depend on which providers are configured for your organization. Use GET /v1/models to see what’s available on your key.

RAG pipeline example

A typical retrieval-augmented generation pipeline using embeddings for search and reranking for precision:

from prism import Prism

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
)

# Step 1: Embed the query
query = "How does photosynthesis work?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query,
).data[0].embedding

# Step 2: Search your vector database (pseudo-code)
# candidates = vector_db.search(query_embedding, top_k=20)

# Step 3: Rerank the candidates for better precision
candidates = [
    "Photosynthesis converts light energy into chemical energy in plants.",
    "Plants use chlorophyll to absorb sunlight during photosynthesis.",
    "The mitochondria is the powerhouse of the cell.",
    "Carbon dioxide and water are inputs to the photosynthesis process.",
]

reranked = client.rerank.create(
    model="rerank-v3.5",
    query=query,
    documents=candidates,
    top_n=3,
)

# Step 4: Use the top results as context for the LLM
context = "\n".join(
    candidates[r.index] for r in reranked.results
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": query},
    ],
)

print(response.choices[0].message.content)

Caching embeddings

The same input always produces the same vector, so embeddings are a good fit for exact-match caching. With caching enabled, repeated inputs return instantly without calling the provider:

from prism import Prism, GatewayConfig, CacheConfig

client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
    config=GatewayConfig(
        cache=CacheConfig(enabled=True, strategy="exact", ttl=86400),
    ),
)

# First call: cache miss, calls the provider
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
)
print(response.prism.cache_status)  # None or "miss"

# Second call with same input: cache hit, instant response
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Hello world",
)
print(response.prism.cache_status)  # "hit_exact"
print(response.prism.cost)          # 0 (no provider call)

Embeddings & reranking

About

Endpoints

Embeddings

Basic usage

Batch embeddings

Reduced dimensions

Encoding format

Response format

Reranking

Basic usage

Parameters

Limiting results

Response format

Supported models

Embedding models

Reranking models

RAG pipeline example

Caching embeddings

Next Steps

Chat completions

Caching

Supported providers

Request & response headers

Questions & Discussion

FutureAGI AI Assistant

About

Endpoints

Embeddings

Basic usage

Batch embeddings

Reduced dimensions

Encoding format

Response format

Reranking

Basic usage

Parameters

Limiting results

Response format

Supported models

Embedding models

Reranking models

RAG pipeline example

Caching embeddings

Next Steps

Chat completions

Caching

Supported providers

Request & response headers

Questions & Discussion