Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The ecosystem

Teaching agents to drive data at scale

Authors
Affiliations
University of California, Berkeley

data-workflows is one of three open-source components that together let researchers and their AI agents work with terabyte-scale data through the tools they already use. Each is independently useful; together they form an AI-native bridge to the mature cloud-native stack the sciences are converging on.

Why a bridge, not a fork

The components this connects — Zarr, xarray, Parquet, DuckDB, Polars, STAC, Jupyter, and the Model Context Protocol — are individually mature open-source projects with wide adoption. What is new is the AI-native layer that lets researchers and their agents drive them together, at scale.

The leverage is that Jupyter and agentic coding assistants are already the field’s everyday tools. The bridge meets that existing base without asking anyone to learn a new toolchain: the agent a scientist already uses makes the move from in-memory libraries to the cloud-native stack on the user’s behalf. And because the MCP layer works with small, locally-run open models, it reduces dependence on closed models and keeps sensitive data — clinical, patient, controlled-access — on local hardware.

From earth observation to the life sciences

These methods were built and proven at the largest earth-observation scales. The same architecture extends to the life sciences, where spatially-resolved data — 3D-genome imaging, spatial transcriptomics, whole-slide microscopy — now arrives at terabyte scale, well beyond the in-memory tools most researchers rely on. The cloud-native foundation to handle it is mature; the missing piece is the agentic layer that makes it reachable. That is what this ecosystem provides.