The pipeline has one job: take data in whatever format it arrives in and produce cloud-native, AI-ready artifacts plus the metadata that describes them — with the actual compute running on autoscaling infrastructure rather than a workstation.
The pipeline¶
1. Describe the dataset¶
You point the cng-datasets CLI at a
source and describe how it should be indexed. The package embeds the core
conversion logic — the format readers, the chunking and indexing strategy, the
memory model — so a single command captures the whole recipe. It does not process
data locally; it generates the Kubernetes job manifests that will.
2. Autoscale the compute¶
You apply those manifests to a Kubernetes cluster. The work fans out across many pods, each handling a slice of the data, and streams larger-than-memory inputs by spilling to disk instead of loading everything into RAM. A dataset that would never fit on a laptop is processed in parallel on cluster metal. Memory and parallelism are tuned per job; failed slices are reprocessed in isolation.
3. Produce cloud-native, AI-ready outputs¶
Outputs land on an object store in open, cloud-optimized formats — columnar Parquet for tables, chunked Zarr for arrays, plus derived spatial indexes. These are read by range-request: a query streams only the bytes it needs rather than downloading the whole file, which is what makes terabyte data tractable from a notebook.
4. Describe it with STAC¶
Alongside the data, the pipeline records STAC (SpatioTemporal Asset Catalog) metadata: the column schema of each asset, its role and format, units, coded-value definitions, and provenance. This is the layer that lets an AI agent find the right dataset and interpret it correctly — without it, a model has to guess at column meanings and silently returns wrong answers.
Why “agentic”¶
As coding assistants become the everyday interface for analysis, they reach by
default for the in-memory libraries they know best (e.g. pandas) — which fail at
scale and, lacking metadata, misread the data. This pipeline closes that gap from
both ends:
It produces the cloud-native artifacts and the STAC metadata an agent needs.
Companion tooling (see the ecosystem) then points agents at that metadata and confines them to validated cloud-native engines, so the agent a scientist already uses makes the move from in-memory tools to the streaming stack on the user’s behalf.
The result is an AI-driven path from a raw legacy file to data that researchers and their agents can query at scale — with no new toolchain to learn.