Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Agentic Data Workflows

AI-ready, cloud-native data with rich metadata — from any legacy source

Authors
Affiliations
University of California, Berkeley

Turn disparate legacy and proprietary data into AI-ready, cloud-native data with rich STAC metadata.

An AI-driven pipeline that converts the formats science actually ships in — geodatabases, shapefiles, proprietary instrument exports, terabyte image stacks — into open, cloud-optimized serializations (Parquet, Zarr) described by machine-readable metadata, so that researchers and their AI agents can query data at scale without first downloading it or mastering a new toolchain.

The core conversion logic lives in our Python package, cng-datasets; the heavy compute autoscales on Kubernetes. You describe the dataset; the pipeline produces the cloud-native artifacts and the catalog entry.

See how it works →  ·  Quickstart  ·  GitHub

What it gives you

General-purpose, proven at scale

The approach is domain-agnostic — any tabular or array data with a spatial, temporal, or otherwise indexable structure fits. We built and proved these methods at the largest earth-observation scales (working with NASA, The Nature Conservancy, and California Fish & Wildlife), and are extending them to the life sciences: spatial transcriptomics, whole-slide microscopy, and 3D-genome imaging at terabyte scale.

Browse the live catalog →