curl --request POST \
--url https://api.zeroentropy.dev/v1/models/rerank \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '
{
"model": "<string>",
"query": "<string>",
"documents": [
"<string>"
],
"top_n": 123,
"latency": "fast"
}
'{
"results": [
{
"index": 123,
"relevance_score": 123
}
],
"total_bytes": 123,
"total_tokens": 123,
"actual_latency_mode": "fast",
"e2e_latency": 123,
"inference_latency": 123
}Reranks the provided documents, according to the provided query.
The results will be sorted by descending order of relevance. For each document, the index and the score will be returned. The index is relative to the documents array that was passed in. The score is the query-document relevancy determined by the reranker model. The results will be returned in descending order of relevance.
Organizations will, by default, have a ratelimit of 500,000 bytes-per-minute (BPM) and 1000 requests-per-minute (RPM). Ratelimits are refreshed every 15 seconds. If this is exceeded, requests will be throttled into latency: "slow" mode, up to 5000000 bytes-per-minute. If even this is exceeded, you will get a 429 error.
The “bytes” used by a request is calculated as sum(150 + query.encode('utf-8') + d.encode('utf-8') for d in documents). Note a baseline overhead of 150 bytes, and that the query bytes are included for each document, as rerankers are cross-encoders. The maximum per-request payload size is 5000000 bytes.
To increase your ratelimits, subscribe to a higher tier on the ZeroEntropy dashboard. Any payments made for subscriptions in a calendar month will be deducted from your usage charges for that month.
To request even higher ratelimits, please contact [email protected] or message us on Discord or Slack!
curl --request POST \
--url https://api.zeroentropy.dev/v1/models/rerank \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '
{
"model": "<string>",
"query": "<string>",
"documents": [
"<string>"
],
"top_n": 123,
"latency": "fast"
}
'{
"results": [
{
"index": 123,
"relevance_score": 123
}
],
"total_bytes": 123,
"total_tokens": 123,
"actual_latency_mode": "fast",
"e2e_latency": 123,
"inference_latency": 123
}Documentation Index
Fetch the complete documentation index at: https://docs.zeroentropy.dev/llms.txt
Use this file to discover all available pages before exploring further.
The model ID to use for reranking. Options are: ["zerank-2", "zerank-1", "zerank-1-small"]
The query to rerank the documents by.
The list of documents to rerank. Each document is a string.
If provided, then only the top n documents will be returned in the results array. Otherwise, n will be the length of the provided documents array.
Whether the call will be inferenced "fast" or "slow". RateLimits for slow API calls are orders of magnitude higher, but you can expect >10 second latency. Fast inferences are guaranteed subsecond, but rate limits are lower. If not specified, first a "fast" call will be attempted, but if you have exceeded your fast rate limit, then a slow call will be executed. If explicitly set to "fast", then 429 will be returned if it cannot be executed fast.
fast, slow Successful Response
The results, ordered by descending order of relevance to the query.
Show child attributes
The total number of bytes in the request. This is used for ratelimiting.
The total number of tokens in the request. This is used for billing.
The type of inference actually used. If auto is requested, then fast will be used by default, with slow as a fallback if your ratelimit is exceeded. Else, this field will be identical to the requested latency mode.
fast, slow The total time, in seconds, between rerank request received and rerank response returned. Client latency should equal e2e_latency + your ping to ZeroEntropy's API.
The time, in seconds, to actually inference the request. If this is significantly lower than e2e_latency, this is likely due to ratelimiting. Please request a higher ratelimit at [email protected] or message us on Discord or Slack!