High-performance retrieval for real enterprise data
Serve multimodal, multi-vector search with built-in reranking, hybrid BM25, graph augmentation, and source-level traceability — optimized for local deployment, high throughput, and grounded enterprise results.
Customer renewal obligations include a 60-day notice period and payment reconciliation under §12.4.
Query in. Grounded results out.
Enterprise retrieval is not just vector search with a chatbot on top.
Enterprise corpora are messy, multimodal, and operationally constrained. Dense ANN alone is not enough. Teams need stronger ranking, hybrid search, real provenance, and an engine that can run on hardware they can actually approve.
Dense-only search leaves quality on the table
Enterprise retrieval needs stronger ranking than approximate dense similarity alone — late interaction stays in the serving path, not in a research notebook.
Real corpora are multimodal and metadata-rich
Search has to work across text, images, pages, fields, and structured context — under one retrieval contract, not three glued-together stacks.
A production engine needs more than benchmark wins
Fast search is not enough without APIs, durable CRUD, WAL recovery, provenance, and deployment flexibility for the environments enterprises can approve.
One retrieval engine for modern enterprise search
Built to serve stronger enterprise retrieval without forcing teams into a distributed search stack.
One serving lane runs late-interaction MaxSim alongside routing, hybrid BM25, optional graph augmentation, and built-in reranking — under a single retrieval contract for CPU and GPU.
Accept text, image, or page-level inputs under one retrieval contract.
Narrow the candidate set with compact routing — no graph build required.
Apply MaxSim as the truth scorer in exact or quantized mode, on CPU or GPU.
Fuse BM25, graph signals, and metadata — only when they improve retrieval.
Refine the top set with built-in reranking before the result leaves the engine.
Emit grounded results with source, page, and lane provenance for downstream trust.
Built around late interaction as the truth scorer.
Most systems optimize the shortlist and treat late interaction as optional. Latence does the reverse: MaxSim remains the final scorer, while routing, pruning, fusion, and quantization make it practical in production.
Route candidates efficiently
Compact routing representations narrow the candidate set quickly without forcing a graph build, keeping cost predictable on every query.
Score with exact or quantized MaxSim
Late interaction stays in the serving path as the ranking truth, with exact or quantized MaxSim across CPU and GPU under one retrieval contract.
Fuse and augment when useful
Combine BM25, optional graph augmentation, and metadata-aware signals — additive, not bolted on, only when they lift retrieval quality.
Return grounded results
Every result carries source and page provenance, score breakdown, and lane attribution for downstream trust and auditability.
Customer renewal obligations include a 60-day notice period and payment reconciliation under §12.4.
More than dense search. Less complexity than a distributed stack.
Generic vector search systems tend to optimize the first stage and stop there. The Retrieval Engine goes further, without forcing teams into a distributed search architecture.
Optimize the first stage. Bolt the rest on. Hope quality holds long enough to ship something that depends on retrieval being right.
Dense-only ranking
Approximate similarity is treated as the final answer instead of a candidate stage, capping quality before reranking can help.
Text-only mindset
Vision, layout, and structured fields get bolted on with separate stacks instead of sharing one retrieval contract.
External ops glue
Durability, recovery, and CRUD are stitched together with side services rather than included in the engine itself.
Approximate-only cascade
There is no exact mode to fall back on when answer quality matters more than another millisecond.
Hosted-by-default bias
Architecture assumes a managed, distributed deployment that procurement and security cannot actually approve.
Late interaction stays in the serving path
MaxSim is the truth scorer, not an optional rerank step that someone forgets to wire up.
Multimodal collections share the same contract as text
Vision, page, and structured inputs use one retrieval surface — no parallel stacks to maintain.
Hybrid BM25 and graph are additive
Lexical fusion and graph augmentation only fire when they lift retrieval — never required, never bolted on.
Exact and quantized MaxSim on the same engine
Routing and quantization improve performance without changing the retrieval contract callers depend on.
CPU and GPU under one API
Single-node local or private deployments behave the same as managed pilots — no rewrite when you move it behind the wall.
Routing and quantization improve performance without changing the retrieval contract. CPU and GPU modes share the same API-facing engine.
A retrieval engine is only as strong as the serving layer behind it.
Latence pairs the Retrieval Engine with a production encoder-serving layer for modern retrieval models — not ad hoc model hosting.
POST /v1/encode
{
"model": "colbert-v2",
"inputs": [
{ "type": "text", "text": "Q4 renewal obligations" }
],
"io": "structured-json"
}{
"vectors": [{ "tokens": 32, "dim": 128 }],
"latency_ms": 14.6,
"batched": true
}12 encoder plugins
Parity-tested across the shipped plugin set — ColBERT, ColPali, ColQwen3, GLiNER and more.
Continuous batching
Server-side scheduler keeps GPUs busy across concurrent encoder requests.
Server-side I/O
IOProcessor-based pre/post-processing keeps client code simple and structured.
No vLLM fork
Plugins ship against upstream vLLM — no custom serving runtime to maintain.
Already real. Publicly benchmarked. Built from working parts.
Numbers come from public READMEs and benchmark tables in voyager-index and vLLM Factory. Nothing on this page is roadmap-only.
GPU search-only latency on public BEIR shards (RTX A5000).
GPU search throughput on the same public production lane.
Production plugins shipped in vLLM Factory with parity checks.
Beta multi-instance throughput uplift on memory-bound encoders.
| Dataset | NDCG@10 | Recall@10 | P95 (ms) | CPU QPS | GPU QPS |
|---|---|---|---|---|---|
| FiQA-2018 | 0.421 | 0.612 | 4.1 | 41.6 | 164.8 |
| SciFact | 0.732 | 0.918 | 2.6 | 271.7 | 346.8 |
| TREC-COVID | 0.801 | 0.248 | 5.0 | 58.9 | 198.4 |
| NFCorpus | 0.378 | 0.301 | 3.2 | 142.3 | 287.6 |
| NQ | 0.579 | 0.871 | 4.8 | 88.4 | 211.2 |
vs vanilla PyTorch on the listed retrieval setup
vs vanilla PyTorch on the listed multimodal setup
parity checks reported across all shipped plugins
Search performance on modest on-prem hardware
Public BEIR shard benchmarks on an RTX A5000 show search-only GPU P95 latencies of 2.6–5.0 ms and GPU throughput of 164.8–346.8 QPS. CPU mode remains viable at 41.6–271.7 QPS on the same lane.
Research-grade techniques in the serving path
Routing, query-time pruning, rotational quantization, context optimization, and graph augmentation are integrated into the shipped engine — not left in research notebooks.
High-throughput encoder serving
vLLM Factory benchmark tables show 4.6x throughput on a ColBERT retrieval setup and 4.9x on a multimodal retrieval setup over vanilla PyTorch, with parity checks reported across all 12 plugins.
Real operational surface
CRUD, WAL, checkpointing, recovery, metadata, FastAPI, OpenAPI, Docker, and CPU/GPU execution are already part of the public stack — not roadmap promises.
Built for the next layer of retrieval trust.
The Retrieval Engine is expanding beyond search quality into search confidence. This is positioned as a forward-looking capability, not the primary proof of the page.
Customers must provide renewal notice 60 days prior to expiration, with payment reconciliation under clause 12.4.
Customer obligations on renewal include a notice period of 60 days and payment reconciliation clauses under §12.4.
Deploy it where your data lives.
Bring your own LLM. Latence is the retrieval layer beneath it — same retrieval contract across managed pilot, private VPC, self-hosted GPU node, and air-gapped fleet.
HTTP API
One contract for search, rerank, and ingestion. OpenAPI-described, ready for service meshes and gateways.
POST /search
{
"query": "Q4 renewal obligations",
"collections": ["contracts", "policy"],
"k": 10,
"rerank": true,
"graph_sidecar": true
}
CLI
Run searches and admin operations from a shell — useful for pilots, CI checks, and air-gapped operators.
latence search "Q4 renewal obligations" \
--collection contracts \
--k 10 --rerank --graph-sidecar
Agents + tool calling
Expose retrieval as a structured tool with stable arguments and grounded responses for agent runtimes.
# agent tool definition
tools:
- name: latence.search
args:
query: string
k: int = 10
rerank: bool = true
graph_sidecar: bool = true
POST /rerank
{
"query": "...",
"candidates": [{ "id": "...", "text": "..." }],
"model": "colbert-v2"
}
The same API works in managed pilot, private VPC, self-hosted single-node, and air-gapped deployments — across CPU and GPU modes.
Bring your own LLM.
Deploy where your data lives.
Start quickly in any environment. Test. Iterate. And move it behind your walls.
Managed pilot
Fastest path to validation with the same product architecture.
Customer cloud / private VPC
Controlled deployment for teams not ready for full self-hosting.
Self-hosted / air-gapped
Built for local GPU environments and high-control enterprise deployments.
If retrieval quality is the bottleneck, this is the layer to upgrade.
The next Latence release packages the Retrieval Engine into a cleaner, local-first enterprise product built on components that already exist today.
The waitlist is for the next unified enterprise release. The underlying retrieval technology already exists.