The catalog - Agentic Data Workflows

The pipeline produces a growing, public catalog of cloud-native datasets, each described by STAC and queryable without download. The geospatial collections below are the proven track record — the same machinery now extends to other domains.

Browse the full catalog in STAC Browser →

What’s published¶

Datasets are hosted on NRP Nautilus object storage and span biodiversity, protected areas, census geographies, carbon, and earth-observation rasters — produced in partnership with NASA, The Nature Conservancy, and California Fish & Wildlife, among others.

Each collection ships with:

Cloud-native data — columnar Parquet for tables, COG/Zarr for rasters and arrays.
Derived spatial indexes — for fast joins and aggregation across datasets.
STAC metadata — schemas, asset roles, units, coded-value definitions, and provenance.

Query without downloading¶

Because outputs are cloud-native and streamed by range-request, you can query a multi-gigabyte dataset directly from a notebook — pulling only the rows and columns a query touches:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""
    SELECT COUNT(*)
    FROM read_parquet('https://s3-west.nrp-nautilus.io/<bucket>/<dataset>.parquet')
""").fetchall()

An AI agent connected through mcp-data-server does the same — but reads the STAC metadata first, so it knows which dataset to open and what each column means before it writes the query.