Live

Trace is live — score groundedness in 280 ms.

Trace

NEWExperimental

Groundedness & phantom-hallucination scoring

Catch hallucinations, drift, and unused context before your users do. Three lanes over one pod: RAG groundedness, agentic-code phantom scoring with cross-turn session chaining, and a stateless session rollup for live scoreboards.

Quickstart

Install the SDK, export your API key, and score your first RAG answer.

# 1. Install the SDK
pip install latence
# 2. Export your API key (get one from the portal)
export LATENCE_API_KEY="lat_..."
from latence import Latence

client = Latence()  # reads LATENCE_API_KEY from the environment

r = client.experimental.trace.rag(
    response_text="Paris is the capital of France.",
    raw_context="France's capital city is Paris.",
)

print(r.score)                   # 0.0 - 1.0
print(r.band)                    # "green" | "amber" | "red" | "unknown"
print(r.context_coverage_ratio)  # how much of the answer is grounded in context
print(r.context_unused_ratio)    # how much retrieved context was dead weight

You now know whether the answer was grounded, how much of your retrieved context was actually used, and whether to trust it. Keep reading for the code and rollup lanes.

Three lanes, one mental model

Pick the lane that matches what your app is doing right now. The URL pins the lane server-side, so payloads cannot be cross-wired.

RAG groundedness

Did the answer actually come from the context you retrieved?

r = client.experimental.trace.rag(
    response_text=answer,
    raw_context=ctx,
)
print(r.band, r.score)
Deep dive
Code agents

Catch phantom APIs and drift across agentic coding turns, with opaque session chaining.

t = client.experimental.trace.code(
    response_text=patch,
    raw_context=repo,
    response_language_hint="python",
)
print(t.session_signals
      .recommendation)
Deep dive
Session rollup

Stateless, sub-ms aggregation of N per-turn outputs into one scoreboard.

rollup = client.experimental
  .trace.rollup(turns=[t1, t2])
print(rollup.noise_pct,
      rollup.retrieval_waste_pct)
Deep dive

Signals → actions

The response fields are routing rules, not diagnostics. Read them and know exactly what to upgrade next.

SignalWhat it meansNext step
band amber/red, low context_coverage_ratioThe answer isn't grounded in what you retrieved.Upgrade data quality →

Clean upstream documents with the Data Intelligence Pipeline.

High context_unused_ratio, retrieval_waste_pct > 30%Your retriever is shipping the wrong chunks.Upgrade retrieval →

Swap in ColSearch, our OSS late-interaction retrieval engine.

session_signals.recommendation = re_anchor / fresh_chatSession drift is compounding across agent turns.Reset the agent's context on the next turn — hand back a fresh session_state or start a new session.

RAG lane — trace.rag()

Score a response for groundedness against retrieval context. At least one of raw_context, chunk_ids, or support_units must be supplied.

POSThttps://api.latence.ai/api/v1/trace/rag

Request parameters

ParameterTypeRequiredDefaultDescription
response_textstringrequiredThe generated response text to score.
query_textstringOptional query for query-conditioned diagnostics.
raw_contextstringRaw context string to segment and encode on demand.
chunk_idslist[str | int]External chunk ids whose stored support vectors to reuse (fast path).
support_unitslist[SupportUnitInput | dict]Structured premise lane with per-unit provenance.
primary_metric"reverse_context" | "triangular"reverse_contextHeadline metric selector.
evidence_limitint (1-128)8Maximum top evidence links in the sparse response.
coverage_thresholdfloat (0.0-1.0)0.5Per-unit reverse-context threshold.
segmentation_mode"sentence" | "sentence_packed" | "paragraph"sentence_packedHow raw_context is segmented.
attribution_mode"closed_book" | "open_domain"closed_bookEvidence policy.
include_triangular_diagnosticsbooltrueInclude query-conditioned diagnostics.
heatmap_format"none" | "data" | "html"dataHeatmap surface. `html` also returns a self-contained <div>.
verification_sampleslist[str]Alternate responses for semantic-entropy fusion.
session_idstringOpaque, hashed session identifier (never raw user text).
request_idstringOptional tracking ID.
verboseboolfalseReturn full diagnostics instead of the compact summary.
return_jobboolfalseReturn JobSubmittedResponse for async polling.

Response fields

FieldTypeDescription
scorefloat | nullTop-level groundedness score (0-1).
primary_metricstring | nullMetric backing `score`. Typically reverse_context or triangular; pod may echo a derived name like groundedness_v2.
bandstring | nullRisk band: green / amber / red / unknown.
structured_scoredict | nullPer-component score breakdown.
nli_aggregatefloat | nullAggregate NLI entailment score.
context_coverage_ratiofloat | nullFraction of the response grounded in context.
context_usage_ratiofloat | nullFraction of context actually used.
context_unused_ratiofloat | nullFraction of context left unused — the retrieval-quality signal.
context_uncertain_ratiofloat | nullFraction with uncertain grounding.
support_units_usageobject | nullAggregate counts of used / unused / uncertain support units.
support_unitslist | nullPer-unit verdicts: source_id, usage_state, coverage_score.
reasonstring | nullHuman-readable reason for the band.
warningslist[str] | nullNon-fatal scorer warnings.
file_attributionobject | nullPer-file owner share and reason-code histogram.
heatmapobject | nullStructured heatmap payload.
heatmap_htmlstring | nullSelf-contained <div> when heatmap_format="html".
latency_msfloat | nullPod-side scoring latency.
scoring_mode"rag" | "code"Lane echoed back by the pod.
session_idstring | nullEchoed session id.
versionstring | nullPod handler version tag.

Structured premises with SupportUnitInput

Pass per-unit provenance (source_id, speaker, timestamp) and the scorer propagates it back onto every support_units verdict.

from latence import Latence, SupportUnitInput

client = Latence()

units = [
    SupportUnitInput(text="Paris is the capital of France.", source_id="doc-42"),
    SupportUnitInput(text="It sits on the Seine.",           source_id="doc-42"),
    {"text": "Population: 2.1M.", "source_id": "wiki"},
]

r = client.experimental.trace.rag(
    response_text="Paris, France's capital, sits on the Seine.",
    support_units=units,
)

for u in (r.support_units or []):
    print(u.source_id, u.usage_state, u.coverage_score)

Client-side validation (before the HTTP call)

  • ValueError: `response_text` must be a non-empty string.
  • ValueError: Trace scoring requires at least one of: raw_context, chunk_ids, or support_units.
$0.008 per request, quantized per 32,000 context tokens. A 64k-token context counts as 2 requests. Token counts are server-measured with tiktoken across raw_context and support_units.

Code lane — trace.code()

Score an agentic-coding turn for phantom APIs and drift. Superset of RAG with three code-lane additions. Chain turns by round-tripping the opaque next_session_state into the next call — the payload is fully opaque to the client.

POSThttps://api.latence.ai/api/v1/trace/code

Additional request parameters (on top of all RAG fields)

ParameterTypeRequiredDefaultDescription
response_language_hint"python" | "typescript" | "go" | "rust" | ...Hint for the AST extractor.
emit_chunk_ownershipboolfalseReturn per-unit ownership table (adds a few KB over the wire).
session_stateSessionState | dictEcho the previous turn's next_session_state verbatim to chain turns.

Additional response fields (on top of all RAG response fields)

FieldTypeDescription
code_laneobject | nullComposite / AST / NLI diagnostics for the turn.
next_session_stateSessionState | nullOpaque state to pass into the next code() call.
session_signalsobject | nullEMA groundedness, drift, phantom rate, recommendation (continue / re_anchor / fresh_chat).

Multi-turn session chaining

from latence import Latence

client = Latence()

turn1 = client.experimental.trace.code(
    response_text="def add(a, b): return a + b",
    raw_context="# utils.py\ndef sub(a, b): return a - b",
    response_language_hint="python",
)

turn2 = client.experimental.trace.code(
    response_text="def mul(a, b): return a * b",
    raw_context="# utils.py\ndef sub(a, b): return a - b",
    response_language_hint="python",
    session_state=turn1.next_session_state,  # chain turns
)

print(turn2.band)
print(turn2.session_signals.recommendation)   # continue | re_anchor | fresh_chat
print(turn2.session_signals.ema_groundedness)

Client-side validation (before the HTTP call)

  • ValueError: `response_text` must be a non-empty string.
  • ValueError: Trace scoring requires at least one of: raw_context, chunk_ids, or support_units.
$2.00 per 1M aggregate tokens, counted with tiktoken across response_text, raw_context, query_text, and support_units.

Rollup lane — trace.rollup()

Aggregate N per-turn outputs into a session scoreboard. Stateless, CPU-only, sub-millisecond on the pod — safe to call on every keystroke.

POSThttps://api.latence.ai/api/v1/trace/rollup

Request parameters

ParameterTypeRequiredDefaultDescription
turnslist[TraceResponse | dict]requiredOrdered per-turn outputs (response objects from .rag() / .code() work directly).
session_idstringEchoed on the response.
heatmap_format"none" | "data" | "html"dataSession-level heatmap surface.
request_idstringOptional tracking ID.

Response fields

FieldTypeDescription
turns_processedint | nullNumber of per-turn outputs aggregated.
noise_pctfloat | nullFraction of turns flagged as noise.
model_drift_pctfloat | nullFraction of turns with model drift.
retrieval_waste_pctfloat | nullFraction of retrieved context left unused across the session.
reason_code_histogramdict[str, int] | nullCount of each reason code over the window.
recommendationslist[str] | nullSession-level recommendations.
risk_band_traillist[str] | nullRisk band per turn, chronological.
drift_trenddict[str, Any] | nullDrift trajectory summary ({last, max, mean, min, ...}).
top_dead_fileslist | nullFiles consistently marked dead-weight.
heatmapobject | nullSession-level heatmap payload.
heatmap_htmlstring | nullSelf-contained <div> when heatmap_format="html".

Example

from latence import Latence

client = Latence()

# turns can be TraceResponse objects from .rag() / .code(), or plain dicts.
rollup = client.experimental.trace.rollup(turns=[turn1, turn2])

print(rollup.noise_pct)              # fraction of turns flagged as noise
print(rollup.retrieval_waste_pct)    # fraction of retrieved context left unused
print(rollup.model_drift_pct)        # fraction of turns with drift
print(rollup.reason_code_histogram)  # why the turns failed, aggregated
print(rollup.risk_band_trail)        # per-turn band, chronological
print(rollup.recommendations)        # actionable session-level advice

Client-side validation (before the HTTP call)

  • ValueError: `turns` must be a non-empty list.
$0.001 flat per request. Lane-neutral (no scoring_mode injection) — you can mix RAG and code turns in the same rollup.

Async and background jobs

Every method has an awaitable twin under AsyncLatence. Pass return_job=True on either rag() or code() to fire-and-forget and poll later (rollup is always inline).

from latence import AsyncLatence

async with AsyncLatence() as client:
    r = await client.experimental.trace.rag(
        response_text="Paris is the capital of France.",
        raw_context="France's capital city is Paris.",
    )
    print(r.score, r.band)

# Background job: fire-and-forget, poll later via client.jobs.wait().
job = client.experimental.trace.rag(
    response_text="...",
    raw_context="...",
    return_job=True,
)
result = client.jobs.wait(job.job_id)

Next steps

Full SDK reference

Every parameter, every response field, every validation rule.

docs/trace.md

Interactive API

Try every endpoint in the browser with your own API key.

Open API reference

Insights dashboard

See live Trace output across your production traffic.

Open Insights

Trace detected a data-quality bottleneck?

When context_coverage_ratio is low or bands trend amber/red, the fix is upstream. Run the Data Intelligence Pipeline over your documents to get clean markdown, resolved entities, and a typed knowledge graph in one call.