Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

How it works

From a legacy source file to AI-ready data and a catalog entry

Authors
Affiliations
University of California, Berkeley

The pipeline has one job: take data in whatever format it arrives in and produce cloud-native, AI-ready artifacts plus the metadata that describes them — with the actual compute running on autoscaling infrastructure rather than a workstation.

The pipeline

1. Describe the dataset

You point the cng-datasets CLI at a source and describe how it should be indexed. The package embeds the core conversion logic — the format readers, the chunking and indexing strategy, the memory model — so a single command captures the whole recipe. It does not process data locally; it generates the Kubernetes job manifests that will.

2. Autoscale the compute

You apply those manifests to a Kubernetes cluster. The work fans out across many pods, each handling a slice of the data, and streams larger-than-memory inputs by spilling to disk instead of loading everything into RAM. A dataset that would never fit on a laptop is processed in parallel on cluster metal. Memory and parallelism are tuned per job; failed slices are reprocessed in isolation.

3. Produce cloud-native, AI-ready outputs

Outputs land on an object store in open, cloud-optimized formats — columnar Parquet for tables, chunked Zarr for arrays, plus derived spatial indexes. These are read by range-request: a query streams only the bytes it needs rather than downloading the whole file, which is what makes terabyte data tractable from a notebook.

4. Describe it with STAC

Alongside the data, the pipeline records STAC (SpatioTemporal Asset Catalog) metadata: the column schema of each asset, its role and format, units, coded-value definitions, and provenance. This is the layer that lets an AI agent find the right dataset and interpret it correctly — without it, a model has to guess at column meanings and silently returns wrong answers.

Why “agentic”

As coding assistants become the everyday interface for analysis, they reach by default for the in-memory libraries they know best (e.g. pandas) — which fail at scale and, lacking metadata, misread the data. This pipeline closes that gap from both ends:

The result is an AI-driven path from a raw legacy file to data that researchers and their agents can query at scale — with no new toolchain to learn.