Agentic Data Workflows - Agentic Data Workflows

Turn disparate legacy and proprietary data into AI-ready, cloud-native data with rich STAC metadata.¶

An AI-driven pipeline that converts the formats science actually ships in — geodatabases, shapefiles, proprietary instrument exports, terabyte image stacks — into open, cloud-optimized serializations (Parquet, Zarr) described by machine-readable metadata, so that researchers and their AI agents can query data at scale without first downloading it or mastering a new toolchain.

The core conversion logic lives in our Python package, cng-datasets; the heavy compute autoscales on Kubernetes. You describe the dataset; the pipeline produces the cloud-native artifacts and the catalog entry.

See how it works → · Quickstart · GitHub

What it gives you¶

🤖 An AI-driven pipeline

Agents drive the workflow end to end — selecting the right cloud-native engine, generating the conversion jobs, and writing the metadata — instead of reaching for in-memory libraries that silently fail at scale.

📦 cng-datasets at the core

A single Python package embeds the conversion logic. One command turns a source dataset into Parquet/Zarr plus a catalog entry, and emits the Kubernetes jobs that do the work.

🗂️ Rich STAC metadata

Every output is described with the SpatioTemporal Asset Catalog standard: column schemas, asset roles, and provenance that let agents find and correctly interpret the data.

☁️ Autoscaled on Kubernetes

Compute runs on the cluster, not your laptop. Jobs fan out across hundreds of pods and stream larger-than-memory data, so terabyte datasets are routine.

General-purpose, proven at scale¶

The approach is domain-agnostic — any tabular or array data with a spatial, temporal, or otherwise indexable structure fits. We built and proved these methods at the largest earth-observation scales (working with NASA, The Nature Conservancy, and California Fish & Wildlife), and are extending them to the life sciences: spatial transcriptomics, whole-slide microscopy, and 3D-genome imaging at terabyte scale.

Browse the live catalog →