Kubernetes Workflows¶

Generate and run complete Kubernetes workflows for large-scale geospatial processing.

Overview¶

The K8s workflow module generates complete workflows for processing datasets into cloud-native formats. The workflow uses a PVC-based orchestration approach that allows stateless execution entirely within Kubernetes.

Architecture¶

PVC-Based Orchestration¶

The workflow uses a Persistent Volume Claim (PVC) to store job YAML files, enabling:

Stateless execution: No git repository dependencies
Laptop disconnect: Workflows run entirely in cluster
Clean separation: YAML files remain readable and separate
Reusability: Same RBAC across all datasets in a namespace

Workflow Components¶

Processing Jobs:
- convert-job.yaml - Convert source format → GeoParquet
- pmtiles-job.yaml - Generate PMTiles vector tiles
- hex-job.yaml - H3 hexagonal tiling with automatic chunking
- repartition-job.yaml - Consolidate chunks by h0 partition
Orchestration Infrastructure:
- workflow-rbac.yaml - ServiceAccount/Role/RoleBinding (one per namespace)
- workflow-pvc.yaml - PVC for storing YAML files
- workflow-upload.yaml - Job for uploading YAMLs to PVC
- workflow.yaml - Orchestrator job that applies jobs from PVC

Quick Start¶

1. Generate Workflow Files¶

cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gpkg \
  --bucket public-my-dataset \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --namespace biodiversity \
  --output-dir my-dataset/

2. Run Complete Workflow¶

# One-time setup per namespace
kubectl apply -f my-dataset/workflow-rbac.yaml

# Create PVC for YAML storage
kubectl apply -f my-dataset/workflow-pvc.yaml

# Upload YAML files to PVC
kubectl apply -f my-dataset/workflow-upload.yaml
kubectl wait --for=condition=ready pod -l job-name=my-dataset-upload-yamls -n biodiversity

# Copy YAML files to PVC
POD=$(kubectl get pods -l job-name=my-dataset-upload-yamls -n biodiversity -o jsonpath='{.items[0].metadata.name}')
kubectl cp my-dataset/convert-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/pmtiles-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/hex-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/repartition-job.yaml $POD:/yamls/ -n biodiversity

# Start orchestrator (laptop can disconnect after this)
kubectl apply -f my-dataset/workflow.yaml

# Monitor progress
kubectl logs -f job/my-dataset-workflow -n biodiversity

3. Run Jobs Individually (Alternative)¶

# Apply RBAC once
kubectl apply -f my-dataset/workflow-rbac.yaml

# Run each job manually
kubectl apply -f my-dataset/convert-job.yaml
kubectl apply -f my-dataset/pmtiles-job.yaml

# Wait for convert to finish before hex
kubectl wait --for=condition=complete job/my-dataset-convert -n biodiversity

kubectl apply -f my-dataset/hex-job.yaml
kubectl wait --for=condition=complete job/my-dataset-hex -n biodiversity

kubectl apply -f my-dataset/repartition-job.yaml

Configuration¶

H3 Resolutions¶

--h3-resolution 10              # Primary resolution (default: 10)
--parent-resolutions "9,8,0"    # Parent hexes for aggregation

Resolution Reference:

h12: ~3m (building-level)
h11: ~10m (lot-level)
h10: ~15m (street-level) - default
h9: ~50m (block-level)
h8: ~175m (neighborhood)
h7: ~600m (district)
h0: continent-scale (partitioning key)

Namespace¶

--namespace biodiversity  # Kubernetes namespace (must exist)

All jobs and RBAC use the specified namespace.

Chunking Behavior¶

The hex job automatically determines optimal chunking:

Uses GDAL to count features from source URL
Targets 200 completions with 50 parallelism
Falls back to defaults if counting fails

Processing Details¶

Two-Pass H3 Approach¶

Chunking Phase (hex-job.yaml):
- Process source data in parallel chunks
- Write to temporary s3://bucket/dataset-name/chunks/
- Each chunk contains all H3 resolutions
Repartition Phase (repartition-job.yaml):
- Read all chunks
- Repartition by h0 hexagon (continent-scale)
- Write to final s3://bucket/dataset-name/hex/
- Delete temporary chunks/ directory

Output Structure¶

s3://bucket/
├── dataset-name.parquet         # GeoParquet with all attributes
├── dataset-name.pmtiles         # PMTiles vector tiles
└── dataset-name/
    └── hex/                     # H3-indexed parquet (partitioned by h0)
        └── h0=0/
            └── *.parquet
        └── h0=1/
            └── *.parquet
        ...

Monitoring & Debugging¶

Check Status¶

# List all jobs
kubectl get jobs -n biodiversity

# Check specific job
kubectl describe job my-dataset-hex -n biodiversity

# View logs
kubectl logs job/my-dataset-workflow -n biodiversity
kubectl logs job/my-dataset-hex-0-xxxxx -n biodiversity

# List pods for a job
kubectl get pods -n biodiversity | grep my-dataset-hex

Common Issues¶

PVC Upload Fails:

# Check uploader pod status
kubectl get pod -l job-name=my-dataset-upload-yamls -n biodiversity

# Check PVC status
kubectl get pvc my-dataset-workflow-yamls -n biodiversity

Orchestrator Can’t Apply Jobs:

# Check RBAC permissions
kubectl get sa,role,rolebinding -n biodiversity | grep cng-datasets-workflow

# Check orchestrator logs
kubectl logs job/my-dataset-workflow -n biodiversity

Job Stuck Pending:

# Check resource limits and node capacity
kubectl describe pod <pod-name> -n biodiversity

Cleanup¶

Delete Jobs and Resources¶

# Delete all dataset jobs and PVC
kubectl delete job my-dataset-convert my-dataset-pmtiles \
  my-dataset-hex my-dataset-repartition \
  my-dataset-upload-yamls my-dataset-workflow \
  -n biodiversity --ignore-not-found=true

kubectl delete pvc my-dataset-workflow-yamls -n biodiversity --ignore-not-found=true

Delete Output Data¶

# Using rclone
rclone purge nrp:bucket-name

# Or specific paths
rclone delete nrp:bucket-name/dataset.parquet
rclone purge nrp:bucket-name/hex/

Advanced Usage¶

Custom Resource Limits¶

Edit individual job YAML files:

resources:
  requests:
    cpu: "2"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "16Gi"

Multiple Datasets¶

The generic RBAC can be shared:

# Apply RBAC once
kubectl apply -f workflow-rbac.yaml

# Run multiple datasets
kubectl apply -f dataset1/workflow.yaml
kubectl apply -f dataset2/workflow.yaml

Each dataset gets its own PVC for YAML files.

Design Rationale¶

Why PVC Instead of Git?¶

No dev repository dependency
Stateless K8s execution
No initContainer overhead

Why Not ConfigMaps?¶

Better readability (files not embedded strings)
No 1MB size limit
Easier to inspect and modify

Why Generic RBAC?¶

One ServiceAccount per namespace
Consistent permissions
No per-dataset RBAC proliferation