Dataset Intelligence
Experimental51.85 creditsCorpus-level knowledge graph construction, ontology induction, and incremental dataset ingestion. Transforms pipeline outputs into entities, relations, graph embeddings, and ontological concepts.
Production Recommendation
This is a direct endpoint for development and testing. For production workloads, use the Data Intelligence Pipeline -- it provides structured Data Packages with quality metrics, is async by default, and is covered by Enterprise SLAs.
Overview
The Dataset Intelligence service turns pipeline outputs into structured knowledge at corpus scale.
Key features:
- •3-tier processing: enrichment (tier 1), knowledge graph (tier 2), ontology (tier 3)
- •Entity resolution and deduplication across documents
- •RotatE link prediction for knowledge graph completion
- •Concept clustering and SHACL ontology induction
- •Delta-aware append mode for incremental ingestion
- •Automatic B2 presigned upload for large payloads
API Reference
https://api.latence.ai/api/v1/dataset_intelligence/processRequest Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
tier | stringtier1tier2tier3full | full | Processing tier | |
input_data | object | — | Pipeline output payload (inline). Mutually exclusive with input_url. | |
input_url | string | — | B2 presigned URL to pipeline output. Use /api/v1/di/presign for large payloads. | |
dataset_id | string | — | Existing dataset ID for append mode (e.g. ds_abc123) | |
mode | stringcreateappend | create | create or append | |
name | string | — | Human-readable dataset name | |
config_overrides | object | — | Override tier-specific configuration | |
total_pages | integer | — | Page count for cost estimation and billing |
Response Fields
| Field | Type | Description |
|---|---|---|
job_id | string | Job identifier (prefix: di_) |
dataset_id | string | Dataset identifier (prefix: ds_) |
status | string | Initial status: QUEUED |
poll_url | string | URL to poll for job status |
cost_estimated | numbernullable | Estimated cost in USD |
pre_billed | boolean | Whether estimated cost was pre-deducted |
Response Example
{
"job_id": "di_abc123def456",
"dataset_id": "ds_xyz789",
"status": "QUEUED",
"poll_url": "/api/v1/pipeline/di_abc123def456",
"cost_estimated": 5.35,
"pre_billed": true
}https://api.latence.ai/api/v1/di/presignRequest Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
content_type | string | application/json | MIME type of the upload |
Response Fields
| Field | Type | Description |
|---|---|---|
upload_url | string | PUT this URL with the pipeline output JSON |
download_url | string | Pass this as input_url in the process request |
Response Example
{
"upload_url": "https://s3.us-west-004.backblazeb2.com/...",
"download_url": "https://s3.us-west-004.backblazeb2.com/..."
}Error Handling
All errors return a JSON body with error and details fields.
| Status | Code | Description |
|---|---|---|
| 400 | INVALID_TIERtier must be one of: full, tier1, tier2, tier3 | Unknown processing tier |
| 400 | INVALID_MODEmode must be one of: append, create | Unknown ingestion mode |
| 400 | MISSING_INPUTProvide input_data (inline) or input_url (presigned URL) | No input data provided |
| 400 | MISSING_DATASET_IDdataset_id is required for append mode | Append mode requires an existing dataset |
| 402 | INSUFFICIENT_BALANCEInsufficient balance | Not enough credits for the estimated cost |
Billing
Pricing Formula
cost = (pages / 1,000) × tier_rate × mode_discountAdd-ons & Multipliers
| Option | Price | Description |
|---|---|---|
| Tier 1 (enrich) | $1.00 / 1K pages | Semantic enrichment, feature vectors |
| Tier 2 (build_graph) | $10.00 / 1K pages | Entity resolution, knowledge graph, RotatE link prediction |
| Tier 3 (build_ontology) | $50.00 / 1K pages | Concept clustering, ontology induction, SHACL shapes |
| Full (run) | $51.85 / 1K pages | All 3 tiers (15% bundle discount) |
| Append mode | −30% | Discount for incremental ingestion into existing dataset |
Pricing Examples
Code Examples
from latence import Latence
client = Latence(api_key="YOUR_API_KEY")
di = client.experimental.dataset_intelligence_service
# Create a new dataset from pipeline output
job = di.run(input_data=pipeline_output, return_job=True)
print(f"Job: {job.job_id}")
# Poll at GET /api/v1/pipeline/{job.job_id}
# Append new documents to existing dataset
delta = di.run(
input_data=new_pipeline_output,
dataset_id="ds_existing_id",
return_job=True,
)
# Individual tiers
result = di.enrich(input_data=pipeline_output) # Tier 1
result = di.build_graph(input_data=pipeline_output) # Tier 2
result = di.build_ontology(input_data=pipeline_output) # Tier 3Best Practices
Run a pipeline first — Dataset Intelligence requires pipeline output as input
Use the full tier for best results; individual tiers are for when you only need specific outputs
For large datasets (>8 MB payload), the SDK automatically uses B2 presigned upload
Use append mode with dataset_id to incrementally update datasets without reprocessing everything
Always pass total_pages for accurate cost estimation and upfront billing
Use return_job=True for production workloads — synchronous calls may timeout for large datasets
The delta_summary in append mode responses shows exactly what changed
Explore Tutorials & Notebooks
Deep-dive examples and interactive notebooks in our GitHub repository
Looking for production-grade processing?
The Data Intelligence Pipeline chains services automatically and returns structured Data Packages.