Live

Trace is live — score groundedness in 280 ms.

RETRIEVAL ENGINE

High-performance retrieval for real enterprise data

Serve multimodal, multi-vector search with built-in reranking, hybrid BM25, graph augmentation, and source-level traceability — optimized for local deployment, high throughput, and grounded enterprise results.

Multimodal + multi-vector nativeLocal, private, and air-gapped readyBuilt-in reranking and hybrid searchSource and page traceability
retrieval console
air-gapped
query42 ms
lanesMaxSim0.94BM250.61graph0.58rerankon
top-k resultsMaxSim · k=3
Q4 renewal obligations
contract_q4.pdf · p.12
0.94
score

Customer renewal obligations include a 60-day notice period and payment reconciliation under §12.4.

sourcecontract_q4.pdfpagep.12traceableyes
Pricing deltas vs. Q3
pricing_global.xlsx · sheet 2
0.87
score
RFP response precedent
rfp_response.eml · thread 7
0.81
score
collection · contracts
GPUA5000

Query in. Grounded results out.

WHY ENTERPRISE RETRIEVAL NEEDS ITS OWN ENGINE

Enterprise retrieval is not just vector search with a chatbot on top.

Enterprise corpora are messy, multimodal, and operationally constrained. Dense ANN alone is not enough. Teams need stronger ranking, hybrid search, real provenance, and an engine that can run on hardware they can actually approve.

01ranking gap

Dense-only search leaves quality on the table

Enterprise retrieval needs stronger ranking than approximate dense similarity alone — late interaction stays in the serving path, not in a research notebook.

late interaction first
02modality gap

Real corpora are multimodal and metadata-rich

Search has to work across text, images, pages, fields, and structured context — under one retrieval contract, not three glued-together stacks.

one retrieval contract
03operations gap

A production engine needs more than benchmark wins

Fast search is not enough without APIs, durable CRUD, WAL recovery, provenance, and deployment flexibility for the environments enterprises can approve.

API + durability + provenance
WHAT IT DOES

One retrieval engine for modern enterprise search

Built to serve stronger enterprise retrieval without forcing teams into a distributed search stack.

ENGINE LANE · query route score augment rerank returngrounded out
Query in. Grounded results out.

One serving lane runs late-interaction MaxSim alongside routing, hybrid BM25, optional graph augmentation, and built-in reranking — under a single retrieval contract for CPU and GPU.

6 connected stagesone retrieval contractsingle-node deployable
01
Query

Accept text, image, or page-level inputs under one retrieval contract.

State
text
image
page
metadata
Multimodal in
02
Route

Narrow the candidate set with compact routing — no graph build required.

State
coarse score
candidate set
budget aware
LEMUR routing
03
Score

Apply MaxSim as the truth scorer in exact or quantized mode, on CPU or GPU.

State
exact MaxSim
quantized MaxSim
CPU / GPU
Late interaction
04
Augment

Fuse BM25, graph signals, and metadata — only when they improve retrieval.

State
BM25 fusion
graph sidecar
metadata-aware
Hybrid + graph
05
Rerank

Refine the top set with built-in reranking before the result leaves the engine.

State
top-k refine
score blending
no extra hop
Built-in
06
Return

Emit grounded results with source, page, and lane provenance for downstream trust.

State
source
page
score
lane
Grounded out
multimodal, multi-vectorbuilt-in rerankinghybrid BM25graph augmentationexact + quantized MaxSimsource + page traceability
HOW IT WORKS

Built around late interaction as the truth scorer.

Most systems optimize the shortlist and treat late interaction as optional. Latence does the reverse: MaxSim remains the final scorer, while routing, pruning, fusion, and quantization make it practical in production.

Step 01

Route candidates efficiently

Compact routing representations narrow the candidate set quickly without forcing a graph build, keeping cost predictable on every query.

candidate pool
pool size1.2M
modalitytext + page
budgetk=512
routed candidates
kept512
laneLEMUR
costO(N) → O(k)
compact routingno graph buildbudget aware
Step 02

Score with exact or quantized MaxSim

Late interaction stays in the serving path as the ranking truth, with exact or quantized MaxSim across CPU and GPU under one retrieval contract.

exact MaxSim
modeexact
deviceGPU A5000
P952.6 ms
quantized MaxSim
modeROQ · 6x
deviceCPU + GPU
lossnegligible
truth scorer kept on the serving path
Step 03

Fuse and augment when useful

Combine BM25, optional graph augmentation, and metadata-aware signals — additive, not bolted on, only when they lift retrieval quality.

hybrid + graph
BM25 fusionon
graph sidecaron
metadata awareyes
lane attribution
MaxSim0.94
BM250.61
graph0.58
additive — never required, never bolted on
Step 04

Return grounded results

Every result carries source and page provenance, score breakdown, and lane attribution for downstream trust and auditability.

grounded result · contract_q4.pdf · p.12

Customer renewal obligations include a 60-day notice period and payment reconciliation under §12.4.

score 0.94
sourcecontract_q4.pdf
page12
laneMaxSim + BM25
traceaudit ready
WHY IT IS BETTER

More than dense search. Less complexity than a distributed stack.

Generic vector search systems tend to optimize the first stage and stop there. The Retrieval Engine goes further, without forcing teams into a distributed search architecture.

generic vector stack

Optimize the first stage. Bolt the rest on. Hope quality holds long enough to ship something that depends on retrieval being right.

shortlist only
embed
ANN
ship
01

Dense-only ranking

Approximate similarity is treated as the final answer instead of a candidate stage, capping quality before reranking can help.

02

Text-only mindset

Vision, layout, and structured fields get bolted on with separate stacks instead of sharing one retrieval contract.

03

External ops glue

Durability, recovery, and CRUD are stitched together with side services rather than included in the engine itself.

04

Approximate-only cascade

There is no exact mode to fall back on when answer quality matters more than another millisecond.

05

Hosted-by-default bias

Architecture assumes a managed, distributed deployment that procurement and security cannot actually approve.

typical result
approx top-kexternal rerank glueno provenance
LATENCE RETRIEVAL ENGINE
01
scoring truth

Late interaction stays in the serving path

MaxSim is the truth scorer, not an optional rerank step that someone forgets to wire up.

02
modality

Multimodal collections share the same contract as text

Vision, page, and structured inputs use one retrieval surface — no parallel stacks to maintain.

03
augmentation

Hybrid BM25 and graph are additive

Lexical fusion and graph augmentation only fire when they lift retrieval — never required, never bolted on.

04
performance

Exact and quantized MaxSim on the same engine

Routing and quantization improve performance without changing the retrieval contract callers depend on.

05
deployment

CPU and GPU under one API

Single-node local or private deployments behave the same as managed pilots — no rewrite when you move it behind the wall.

retrieval quality and operational discipline in the same product

Routing and quantization improve performance without changing the retrieval contract. CPU and GPU modes share the same API-facing engine.

SERVING LAYER

A retrieval engine is only as strong as the serving layer behind it.

Latence pairs the Retrieval Engine with a production encoder-serving layer for modern retrieval models — not ad hoc model hosting.

vLLM FACTORY · ENCODER SERVINGcontinuous batching · structured I/O
active serving profile
ColBERT
263 req/s
plugin
colbert-v2
parity
12/12 plugins
request · structured json
POST /v1/encode
{
  "model": "colbert-v2",
  "inputs": [
    { "type": "text", "text": "Q4 renewal obligations" }
  ],
  "io": "structured-json"
}
response · structured json
{
  "vectors": [{ "tokens": 32, "dim": 128 }],
  "latency_ms": 14.6,
  "batched": true
}

12 encoder plugins

Parity-tested across the shipped plugin set — ColBERT, ColPali, ColQwen3, GLiNER and more.

Continuous batching

Server-side scheduler keeps GPUs busy across concurrent encoder requests.

Server-side I/O

IOProcessor-based pre/post-processing keeps client code simple and structured.

No vLLM fork

Plugins ship against upstream vLLM — no custom serving runtime to maintain.

PROOF

Already real. Publicly benchmarked. Built from working parts.

Numbers come from public READMEs and benchmark tables in voyager-index and vLLM Factory. Nothing on this page is roadmap-only.

2.6 – 5.0ms P95

GPU search-only latency on public BEIR shards (RTX A5000).

164.8 – 346.8GPU QPS

GPU search throughput on the same public production lane.

12encoders

Production plugins shipped in vLLM Factory with parity checks.

~2xmulti-instance

Beta multi-instance throughput uplift on memory-bound encoders.

BEIR · PER-DATASETvoyager-index / RTX A5000
FiQA-2018164.8 GPU QPS
NDCG@100.421
Recall@100.612
P95 (ms)4.1
CPU QPS41.6
SciFact346.8 GPU QPS
NDCG@100.732
Recall@100.918
P95 (ms)2.6
CPU QPS271.7
TREC-COVID198.4 GPU QPS
NDCG@100.801
Recall@100.248
P95 (ms)5.0
CPU QPS58.9
NFCorpus287.6 GPU QPS
NDCG@100.378
Recall@100.301
P95 (ms)3.2
CPU QPS142.3
NQ211.2 GPU QPS
NDCG@100.579
Recall@100.871
P95 (ms)4.8
CPU QPS88.4
GPU QPS · per-dataset
vLLM Factory · ColBERT retrieval4.6x

vs vanilla PyTorch on the listed retrieval setup

vLLM Factory · Multimodal retrieval4.9x

vs vanilla PyTorch on the listed multimodal setup

vLLM Factory · Plugin parity12 / 12

parity checks reported across all shipped plugins

BETA

Built for the next layer of retrieval trust.

The Retrieval Engine is expanding beyond search quality into search confidence. This is positioned as a forward-looking capability, not the primary proof of the page.

RESPONSE vs CONTEXT · scoringtrust instrumentation
Beta capabilitynot GA · stability subject to change
generated response

Customers must provide renewal notice 60 days prior to expiration, with payment reconciliation under clause 12.4.

cited context · contract_q4.pdf · p.12

Customer obligations on renewal include a notice period of 60 days and payment reconciliation clauses under §12.4.

context score
0.92response grounded in cited context
hallucination risk
0.08low · evidence overlap high
evaluation loop
evidence overlap0.87
cited spans2 / 2
answer coveragefull
DEPLOYMENT + INTEGRATION

Deploy it where your data lives.

Bring your own LLM. Latence is the retrieval layer beneath it — same retrieval contract across managed pilot, private VPC, self-hosted GPU node, and air-gapped fleet.

HTTP API

One contract for search, rerank, and ingestion. OpenAPI-described, ready for service meshes and gateways.

POST /search
POST /search
{
  "query": "Q4 renewal obligations",
  "collections": ["contracts", "policy"],
  "k": 10,
  "rerank": true,
  "graph_sidecar": true
}

CLI

Run searches and admin operations from a shell — useful for pilots, CI checks, and air-gapped operators.

latence search
latence search "Q4 renewal obligations" \
  --collection contracts \
  --k 10 --rerank --graph-sidecar

Agents + tool calling

Expose retrieval as a structured tool with stable arguments and grounded responses for agent runtimes.

agent.tools.search
# agent tool definition
tools:
  - name: latence.search
    args:
      query: string
      k: int = 10
      rerank: bool = true
      graph_sidecar: bool = true
POST /rerank
POST /rerank
{
  "query": "...",
  "candidates": [{ "id": "...", "text": "..." }],
  "model": "colbert-v2"
}
one retrieval contract

The same API works in managed pilot, private VPC, self-hosted single-node, and air-gapped deployments — across CPU and GPU modes.

managed pilotprivate VPCself-hosted GPUair-gapped
03Deployment

Bring your own LLM. Deploy where your data lives.

Start quickly in any environment. Test. Iterate. And move it behind your walls.

01
Fastest validation

Managed pilot

Fastest path to validation with the same product architecture.

02
Controlled deployment

Customer cloud / private VPC

Controlled deployment for teams not ready for full self-hosting.

03
Production control

Self-hosted / air-gapped

Built for local GPU environments and high-control enterprise deployments.

EARLY ACCESS

If retrieval quality is the bottleneck, this is the layer to upgrade.

The next Latence release packages the Retrieval Engine into a cleaner, local-first enterprise product built on components that already exist today.

The waitlist is for the next unified enterprise release. The underlying retrieval technology already exists.