The CLI runs on your laptop and only ever generates Kubernetes job manifests.
The cluster does all the processing; your laptop just talks to kubectl. Outputs
land on object storage as cloud-native Parquet and derived indexes, described by STAC.
1. Install the CLI¶
pip install cng-datasets2. Generate a processing pipeline¶
Point the CLI at a source and describe how it should be indexed. This writes a set
of Kubernetes manifests to --output-dir — it does not move any data yet.
cng-datasets workflow \
--dataset my-dataset \
--source-url https://example.com/data.gdb \
--bucket public-mydata \
--layer MyLayer \
--output-dir catalog/mydata/k8s/mylayer3. Apply the workflow to the cluster¶
# One-time RBAC setup (per cluster/namespace; often already done)
kubectl apply -f catalog/mydata/k8s/mylayer/workflow-rbac.yaml
# Run the workflow
kubectl apply \
-f catalog/mydata/k8s/mylayer/configmap.yaml \
-f catalog/mydata/k8s/mylayer/workflow.yaml4. Monitor¶
kubectl get jobs | grep my-dataset
kubectl logs job/my-dataset-workflowThe cluster converts the source, writes the cloud-native outputs to the object store, and the catalog entry is published alongside them. When it finishes, the dataset is ready to query at scale — and ready for agents to discover via its STAC metadata.