RETRIEVAL ENGINE

High-performance retrieval for real enterprise data

Serve multimodal, multi-vector search with built-in reranking, hybrid BM25, graph augmentation, and source-level traceability — optimized for local deployment, high throughput, and grounded enterprise results.

Request Early Access See What's Already Real

Multimodal + multi-vector nativeLocal, private, and air-gapped readyBuilt-in reranking and hybrid searchSource and page traceability

retrieval console

air-gappedhealthy

query42 ms

lanesMaxSim0.94BM250.61graph0.58rerankon

top-k resultsMaxSim · k=3

Q4 renewal obligationstext

contract_q4.pdf · p.12

0.94

score

Customer renewal obligations include a 60-day notice period and payment reconciliation under §12.4.

sourcecontract_q4.pdfpagep.12traceableyes

Pricing deltas vs. Q3vision

pricing_global.xlsx · sheet 2

0.87

score

RFP response precedentemail

rfp_response.eml · thread 7

0.81

score

collection · contracts2,184 chunks

GPUA5000

Query in. Grounded results out.

WHY ENTERPRISE RETRIEVAL NEEDS ITS OWN ENGINE

Enterprise retrieval is not just vector search with a chatbot on top.

Enterprise corpora are messy, multimodal, and operationally constrained. Dense ANN alone is not enough. Teams need stronger ranking, hybrid search, real provenance, and an engine that can run on hardware they can actually approve.

01ranking gap

Dense-only search leaves quality on the table

Enterprise retrieval needs stronger ranking than approximate dense similarity alone — late interaction stays in the serving path, not in a research notebook.

late interaction first

02modality gap

Real corpora are multimodal and metadata-rich

Search has to work across text, images, pages, fields, and structured context — under one retrieval contract, not three glued-together stacks.

one retrieval contract

03operations gap

A production engine needs more than benchmark wins

Fast search is not enough without APIs, durable CRUD, WAL recovery, provenance, and deployment flexibility for the environments enterprises can approve.

API + durability + provenance

WHAT IT DOES

One retrieval engine for modern enterprise search

Built to serve stronger enterprise retrieval without forcing teams into a distributed search stack.

ENGINE LANE · query route score augment rerank returngrounded out

Query in. Grounded results out.

One serving lane runs late-interaction MaxSim alongside routing, hybrid BM25, optional graph augmentation, and built-in reranking — under a single retrieval contract for CPU and GPU.

6 connected stagesone retrieval contractsingle-node deployable

Query

Accept text, image, or page-level inputs under one retrieval contract.

State

text

image

page

metadata

Multimodal in

Route

Narrow the candidate set with compact routing — no graph build required.

State

coarse score

candidate set

budget aware

LEMUR routing

Score

Apply MaxSim as the truth scorer in exact or quantized mode, on CPU or GPU.

State

exact MaxSim

quantized MaxSim

CPU / GPU

Late interaction

Augment

Fuse BM25, graph signals, and metadata — only when they improve retrieval.

State

BM25 fusion

graph sidecar

metadata-aware

Hybrid + graph

Rerank

Refine the top set with built-in reranking before the result leaves the engine.

State

top-k refine

score blending

no extra hop

Built-in

Return

Emit grounded results with source, page, and lane provenance for downstream trust.

State

source

page

score

lane

Grounded out

multimodal, multi-vectorbuilt-in rerankinghybrid BM25graph augmentationexact + quantized MaxSimsource + page traceability

HOW IT WORKS

Built around late interaction as the truth scorer.

Most systems optimize the shortlist and treat late interaction as optional. Latence does the reverse: MaxSim remains the final scorer, while routing, pruning, fusion, and quantization make it practical in production.

Step 01

Route candidates efficiently

Compact routing representations narrow the candidate set quickly without forcing a graph build, keeping cost predictable on every query.

candidate pool

pool size1.2M

modalitytext + page

budgetk=512

routed candidates

kept512

laneLEMUR

costO(N) → O(k)

compact routingno graph buildbudget aware

Step 02

Score with exact or quantized MaxSim

Late interaction stays in the serving path as the ranking truth, with exact or quantized MaxSim across CPU and GPU under one retrieval contract.

exact MaxSim

modeexact

deviceGPU A5000

P952.6 ms

quantized MaxSim

modeROQ · 6x

deviceCPU + GPU

lossnegligible

truth scorer kept on the serving path

Step 03

Fuse and augment when useful

Combine BM25, optional graph augmentation, and metadata-aware signals — additive, not bolted on, only when they lift retrieval quality.

hybrid + graph

BM25 fusionon

graph sidecaron

metadata awareyes

lane attribution

MaxSim0.94

BM250.61

graph0.58

additive — never required, never bolted on

Step 04

Return grounded results

Every result carries source and page provenance, score breakdown, and lane attribution for downstream trust and auditability.

grounded result · contract_q4.pdf · p.12

Customer renewal obligations include a 60-day notice period and payment reconciliation under §12.4.

score 0.94

sourcecontract_q4.pdf

page12

laneMaxSim + BM25

traceaudit ready

WHY IT IS BETTER

More than dense search. Less complexity than a distributed stack.

Generic vector search systems tend to optimize the first stage and stop there. The Retrieval Engine goes further, without forcing teams into a distributed search architecture.

generic vector stack

Optimize the first stage. Bolt the rest on. Hope quality holds long enough to ship something that depends on retrieval being right.

shortlist only

embed

ANN

ship

Dense-only ranking

Approximate similarity is treated as the final answer instead of a candidate stage, capping quality before reranking can help.

Text-only mindset

Vision, layout, and structured fields get bolted on with separate stacks instead of sharing one retrieval contract.

External ops glue

Durability, recovery, and CRUD are stitched together with side services rather than included in the engine itself.

Approximate-only cascade

There is no exact mode to fall back on when answer quality matters more than another millisecond.

Hosted-by-default bias

Architecture assumes a managed, distributed deployment that procurement and security cannot actually approve.

typical result

approx top-kexternal rerank glueno provenance

LATENCE RETRIEVAL ENGINEtruth scoring · one contract

scoring truth

Late interaction stays in the serving path

MaxSim is the truth scorer, not an optional rerank step that someone forgets to wire up.

modality

Multimodal collections share the same contract as text

Vision, page, and structured inputs use one retrieval surface — no parallel stacks to maintain.

augmentation

Hybrid BM25 and graph are additive

Lexical fusion and graph augmentation only fire when they lift retrieval — never required, never bolted on.

performance

Exact and quantized MaxSim on the same engine

Routing and quantization improve performance without changing the retrieval contract callers depend on.

deployment

CPU and GPU under one API

Single-node local or private deployments behave the same as managed pilots — no rewrite when you move it behind the wall.

retrieval quality and operational discipline in the same product

Routing and quantization improve performance without changing the retrieval contract. CPU and GPU modes share the same API-facing engine.

SERVING LAYER

A retrieval engine is only as strong as the serving layer behind it.

Latence pairs the Retrieval Engine with a production encoder-serving layer for modern retrieval models — not ad hoc model hosting.

vLLM FACTORY · ENCODER SERVINGcontinuous batching · structured I/O

active serving profile

ColBERT

263 req/s

plugin

colbert-v2

parity

12/12 plugins

request · structured json

POST /v1/encode
{
  "model": "colbert-v2",
  "inputs": [
    { "type": "text", "text": "Q4 renewal obligations" }
  ],
  "io": "structured-json"
}

response · structured json

{
  "vectors": [{ "tokens": 32, "dim": 128 }],
  "latency_ms": 14.6,
  "batched": true
}

12 encoder plugins

Parity-tested across the shipped plugin set — ColBERT, ColPali, ColQwen3, GLiNER and more.

Continuous batching

Server-side scheduler keeps GPUs busy across concurrent encoder requests.

Server-side I/O

IOProcessor-based pre/post-processing keeps client code simple and structured.

No vLLM fork

Plugins ship against upstream vLLM — no custom serving runtime to maintain.

PROOF

Already real. Publicly benchmarked. Built from working parts.

Numbers come from public READMEs and benchmark tables in voyager-index and vLLM Factory. Nothing on this page is roadmap-only.

2.6 – 5.0ms P95

GPU search-only latency on public BEIR shards (RTX A5000).

164.8 – 346.8GPU QPS

GPU search throughput on the same public production lane.

12encoders

Production plugins shipped in vLLM Factory with parity checks.

~2xmulti-instance

Beta multi-instance throughput uplift on memory-bound encoders.

BEIR · PER-DATASETvoyager-index / RTX A5000

FiQA-2018164.8 GPU QPS

NDCG@100.421

Recall@100.612

P95 (ms)4.1

CPU QPS41.6

SciFact346.8 GPU QPS

NDCG@100.732

Recall@100.918

P95 (ms)2.6

CPU QPS271.7

TREC-COVID198.4 GPU QPS

NDCG@100.801

Recall@100.248

P95 (ms)5.0

CPU QPS58.9

NFCorpus287.6 GPU QPS

NDCG@100.378

Recall@100.301

P95 (ms)3.2

CPU QPS142.3

NQ211.2 GPU QPS

NDCG@100.579

Recall@100.871

P95 (ms)4.8

CPU QPS88.4

Dataset	NDCG@10	Recall@10	P95 (ms)	CPU QPS	GPU QPS
FiQA-2018	0.421	0.612	4.1	41.6	164.8
SciFact	0.732	0.918	2.6	271.7	346.8
TREC-COVID	0.801	0.248	5.0	58.9	198.4
NFCorpus	0.378	0.301	3.2	142.3	287.6
NQ	0.579	0.871	4.8	88.4	211.2

GPU QPS · per-dataset

vLLM Factory · ColBERT retrieval4.6x

vs vanilla PyTorch on the listed retrieval setup

vLLM Factory · Multimodal retrieval4.9x

vs vanilla PyTorch on the listed multimodal setup

vLLM Factory · Plugin parity12 / 12

parity checks reported across all shipped plugins

Open source · Rust

Search performance on modest on-prem hardware

Public BEIR shard benchmarks on an RTX A5000 show search-only GPU P95 latencies of 2.6–5.0 ms and GPU throughput of 164.8–346.8 QPS. CPU mode remains viable at 41.6–271.7 QPS on the same lane.

voyager-index / README.mdsource

MaxSim · ROQ · LEMUR

Research-grade techniques in the serving path

Routing, query-time pruning, rotational quantization, context optimization, and graph augmentation are integrated into the shipped engine — not left in research notebooks.

voyager-index / docssource

Open source · Python

High-throughput encoder serving

vLLM Factory benchmark tables show 4.6x throughput on a ColBERT retrieval setup and 4.9x on a multimodal retrieval setup over vanilla PyTorch, with parity checks reported across all 12 plugins.

vllm-factory / benchmarkssource

FastAPI · OpenAPI

Real operational surface

CRUD, WAL, checkpointing, recovery, metadata, FastAPI, OpenAPI, Docker, and CPU/GPU execution are already part of the public stack — not roadmap promises.

voyager-index / serversource

BETA

Built for the next layer of retrieval trust.

The Retrieval Engine is expanding beyond search quality into search confidence. This is positioned as a forward-looking capability, not the primary proof of the page.

RESPONSE vs CONTEXT · scoringtrust instrumentation

Beta capabilitynot GA · stability subject to change

generated response

Customers must provide renewal notice 60 days prior to expiration, with payment reconciliation under clause 12.4.

cited context · contract_q4.pdf · p.12

Customer obligations on renewal include a notice period of 60 days and payment reconciliation clauses under §12.4.

context score

0.92response grounded in cited context

hallucination risk

0.08low · evidence overlap high

evaluation loop

evidence overlap0.87

cited spans2 / 2

answer coveragefull

DEPLOYMENT + INTEGRATION

Deploy it where your data lives.

Bring your own LLM. Latence is the retrieval layer beneath it — same retrieval contract across managed pilot, private VPC, self-hosted GPU node, and air-gapped fleet.

HTTP API

One contract for search, rerank, and ingestion. OpenAPI-described, ready for service meshes and gateways.

POST /search

POST /search
{
  "query": "Q4 renewal obligations",
  "collections": ["contracts", "policy"],
  "k": 10,
  "rerank": true,
  "graph_sidecar": true
}

CLI

Run searches and admin operations from a shell — useful for pilots, CI checks, and air-gapped operators.

latence search

latence search "Q4 renewal obligations" \
  --collection contracts \
  --k 10 --rerank --graph-sidecar

Agents + tool calling

Expose retrieval as a structured tool with stable arguments and grounded responses for agent runtimes.

agent.tools.search

# agent tool definition
tools:
  - name: latence.search
    args:
      query: string
      k: int = 10
      rerank: bool = true
      graph_sidecar: bool = true

POST /rerank

POST /rerank
{
  "query": "...",
  "candidates": [{ "id": "...", "text": "..." }],
  "model": "colbert-v2"
}

one retrieval contract

The same API works in managed pilot, private VPC, self-hosted single-node, and air-gapped deployments — across CPU and GPU modes.

managed pilotprivate VPCself-hosted GPUair-gapped

03Deployment

Bring your own LLM.
Deploy where your data lives.

Start quickly in any environment. Test. Iterate. And move it behind your walls.

Fastest validation

Managed pilot

Fastest path to validation with the same product architecture.

Controlled deployment

Customer cloud / private VPC

Controlled deployment for teams not ready for full self-hosting.

Production control

Self-hosted / air-gapped

Built for local GPU environments and high-control enterprise deployments.

EARLY ACCESS

If retrieval quality is the bottleneck, this is the layer to upgrade.

The next Latence release packages the Retrieval Engine into a cleaner, local-first enterprise product built on components that already exist today.

Request Early Access Explore What's Already Real

The waitlist is for the next unified enterprise release. The underlying retrieval technology already exists.

High-performance retrieval for real enterprise data

Dense-only search leaves quality on the table

Real corpora are multimodal and metadata-rich

A production engine needs more than benchmark wins

Route candidates efficiently

Score with exact or quantized MaxSim

Fuse and augment when useful

Return grounded results

Dense-only ranking

Text-only mindset

External ops glue

Approximate-only cascade

Hosted-by-default bias

Late interaction stays in the serving path

Multimodal collections share the same contract as text

Hybrid BM25 and graph are additive

Exact and quantized MaxSim on the same engine

CPU and GPU under one API

12 encoder plugins

Continuous batching

Server-side I/O

No vLLM fork

Search performance on modest on-prem hardware

Research-grade techniques in the serving path

High-throughput encoder serving

Real operational surface

HTTP API

CLI

Agents + tool calling

Bring your own LLM. Deploy where your data lives.

Managed pilot

Customer cloud / private VPC

Self-hosted / air-gapped

If retrieval quality is the bottleneck, this is the layer to upgrade.

Bring your own LLM.
Deploy where your data lives.