Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

The catalog

Proven at scale on real, open datasets

Authors
Affiliations
University of California, Berkeley

The pipeline produces a growing, public catalog of cloud-native datasets, each described by STAC and queryable without download. The geospatial collections below are the proven track record — the same machinery now extends to other domains.

Browse the full catalog in STAC Browser →

What’s published

Datasets are hosted on NRP Nautilus object storage and span biodiversity, protected areas, census geographies, carbon, and earth-observation rasters — produced in partnership with NASA, The Nature Conservancy, and California Fish & Wildlife, among others.

Each collection ships with:

Query without downloading

Because outputs are cloud-native and streamed by range-request, you can query a multi-gigabyte dataset directly from a notebook — pulling only the rows and columns a query touches:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("""
    SELECT COUNT(*)
    FROM read_parquet('https://s3-west.nrp-nautilus.io/<bucket>/<dataset>.parquet')
""").fetchall()

An AI agent connected through mcp-data-server does the same — but reads the STAC metadata first, so it knows which dataset to open and what each column means before it writes the query.