Rerank

curl --request POST \ --url https://api.zeroentropy.dev/v1/models/rerank \ --header 'Authorization: Bearer <token>' \ --header 'Content-Type: application/json' \ --data ' { "model": "<string>", "query": "<string>", "documents": [ "<string>" ], "top_n": 123, "latency": "fast" } '

{ "results": [ { "index": 123, "relevance_score": 123 } ], "total_bytes": 123, "total_tokens": 123, "actual_latency_mode": "fast", "e2e_latency": 123, "inference_latency": 123 }

Authorizations

Authorization

string

header

required

The Authorization header must be provided in the format Bearer <your-api-key>.

You can get your API Key at the Dashboard!

Body

application/json

model

string

required

The model ID to use for reranking. Options are: ["zerank-2", "zerank-1", "zerank-1-small"]

query

string

required

The query to rerank the documents by.

documents

string[]

required

The list of documents to rerank. Each document is a string.

top_n

integer | null

If provided, then only the top n documents will be returned in the results array. Otherwise, n will be the length of the provided documents array.

latency

enum<string> | null

Whether the call will be inferenced "fast" or "slow". RateLimits for slow API calls are orders of magnitude higher, but you can expect >10 second latency. Fast inferences are guaranteed subsecond, but rate limits are lower. If not specified, first a "fast" call will be attempted, but if you have exceeded your fast rate limit, then a slow call will be executed. If explicitly set to "fast", then 429 will be returned if it cannot be executed fast.

Available options:

fast,

slow

Response

Successful Response

results

RerankResult · object[]

required

The results, ordered by descending order of relevance to the query.

Show child attributes

total_bytes

integer

required

The total number of bytes in the request. This is used for ratelimiting.

total_tokens

integer

required

The total number of tokens in the request. This is used for billing.

actual_latency_mode

enum<string>

required

The type of inference actually used. If auto is requested, then fast will be used by default, with slow as a fallback if your ratelimit is exceeded. Else, this field will be identical to the requested latency mode.

Available options:

fast,

slow

e2e_latency

number

required

The total time, in seconds, between rerank request received and rerank response returned. Client latency should equal e2e_latency + your ping to ZeroEntropy's API.

inference_latency

number

required

The time, in seconds, to actually inference the request. If this is significantly lower than e2e_latency, this is likely due to ratelimiting. Please request a higher ratelimit at [email protected] or message us on Discord or Slack!

Models

Status

Collections

Documents

Queries

Authorizations

Body

Response

Models

Status

Collections

Documents

Queries

Documentation Index

Authorizations

Body

Response