Kubernetes Workflows

Generate and run complete Kubernetes workflows for large-scale geospatial processing.

Overview

The K8s workflow module generates complete workflows for processing datasets into cloud-native formats. The workflow uses a PVC-based orchestration approach that allows stateless execution entirely within Kubernetes.

Architecture

PVC-Based Orchestration

The workflow uses a Persistent Volume Claim (PVC) to store job YAML files, enabling:

  • Stateless execution: No git repository dependencies

  • Laptop disconnect: Workflows run entirely in cluster

  • Clean separation: YAML files remain readable and separate

  • Reusability: Same RBAC across all datasets in a namespace

Workflow Components

  1. Processing Jobs:

    • convert-job.yaml - Convert source format → GeoParquet

    • pmtiles-job.yaml - Generate PMTiles vector tiles

    • hex-job.yaml - H3 hexagonal tiling with automatic chunking

    • repartition-job.yaml - Consolidate chunks by h0 partition

  2. Orchestration Infrastructure:

    • workflow-rbac.yaml - ServiceAccount/Role/RoleBinding (one per namespace)

    • workflow-pvc.yaml - PVC for storing YAML files

    • workflow-upload.yaml - Job for uploading YAMLs to PVC

    • workflow.yaml - Orchestrator job that applies jobs from PVC

Quick Start

1. Generate Workflow Files

cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gpkg \
  --bucket public-my-dataset \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --namespace biodiversity \
  --output-dir my-dataset/

2. Run Complete Workflow

# One-time setup per namespace
kubectl apply -f my-dataset/workflow-rbac.yaml

# Create PVC for YAML storage
kubectl apply -f my-dataset/workflow-pvc.yaml

# Upload YAML files to PVC
kubectl apply -f my-dataset/workflow-upload.yaml
kubectl wait --for=condition=ready pod -l job-name=my-dataset-upload-yamls -n biodiversity

# Copy YAML files to PVC
POD=$(kubectl get pods -l job-name=my-dataset-upload-yamls -n biodiversity -o jsonpath='{.items[0].metadata.name}')
kubectl cp my-dataset/convert-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/pmtiles-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/hex-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/repartition-job.yaml $POD:/yamls/ -n biodiversity

# Start orchestrator (laptop can disconnect after this)
kubectl apply -f my-dataset/workflow.yaml

# Monitor progress
kubectl logs -f job/my-dataset-workflow -n biodiversity

3. Run Jobs Individually (Alternative)

# Apply RBAC once
kubectl apply -f my-dataset/workflow-rbac.yaml

# Run each job manually
kubectl apply -f my-dataset/convert-job.yaml
kubectl apply -f my-dataset/pmtiles-job.yaml

# Wait for convert to finish before hex
kubectl wait --for=condition=complete job/my-dataset-convert -n biodiversity

kubectl apply -f my-dataset/hex-job.yaml
kubectl wait --for=condition=complete job/my-dataset-hex -n biodiversity

kubectl apply -f my-dataset/repartition-job.yaml

Configuration

H3 Resolutions

--h3-resolution 10              # Primary resolution (default: 10)
--parent-resolutions "9,8,0"    # Parent hexes for aggregation

Resolution Reference:

  • h12: ~3m (building-level)

  • h11: ~10m (lot-level)

  • h10: ~15m (street-level) - default

  • h9: ~50m (block-level)

  • h8: ~175m (neighborhood)

  • h7: ~600m (district)

  • h0: continent-scale (partitioning key)

Namespace

--namespace biodiversity  # Kubernetes namespace (must exist)

All jobs and RBAC use the specified namespace.

Chunking Behavior

The hex job automatically determines optimal chunking:

  • Uses GDAL to count features from source URL

  • Targets 200 completions with 50 parallelism

  • Falls back to defaults if counting fails

Processing Details

Two-Pass H3 Approach

  1. Chunking Phase (hex-job.yaml):

    • Process source data in parallel chunks

    • Write to temporary s3://bucket/dataset-name/chunks/

    • Each chunk contains all H3 resolutions

  2. Repartition Phase (repartition-job.yaml):

    • Read all chunks

    • Repartition by h0 hexagon (continent-scale)

    • Write to final s3://bucket/dataset-name/hex/

    • Delete temporary chunks/ directory

Output Structure

s3://bucket/
├── dataset-name.parquet         # GeoParquet with all attributes
├── dataset-name.pmtiles         # PMTiles vector tiles
└── dataset-name/
    └── hex/                     # H3-indexed parquet (partitioned by h0)
        └── h0=0/
            └── *.parquet
        └── h0=1/
            └── *.parquet
        ...

Monitoring & Debugging

Check Status

# List all jobs
kubectl get jobs -n biodiversity

# Check specific job
kubectl describe job my-dataset-hex -n biodiversity

# View logs
kubectl logs job/my-dataset-workflow -n biodiversity
kubectl logs job/my-dataset-hex-0-xxxxx -n biodiversity

# List pods for a job
kubectl get pods -n biodiversity | grep my-dataset-hex

Common Issues

PVC Upload Fails:

# Check uploader pod status
kubectl get pod -l job-name=my-dataset-upload-yamls -n biodiversity

# Check PVC status
kubectl get pvc my-dataset-workflow-yamls -n biodiversity

Orchestrator Can’t Apply Jobs:

# Check RBAC permissions
kubectl get sa,role,rolebinding -n biodiversity | grep cng-datasets-workflow

# Check orchestrator logs
kubectl logs job/my-dataset-workflow -n biodiversity

Job Stuck Pending:

# Check resource limits and node capacity
kubectl describe pod <pod-name> -n biodiversity

Cleanup

Delete Jobs and Resources

# Delete all dataset jobs and PVC
kubectl delete job my-dataset-convert my-dataset-pmtiles \
  my-dataset-hex my-dataset-repartition \
  my-dataset-upload-yamls my-dataset-workflow \
  -n biodiversity --ignore-not-found=true

kubectl delete pvc my-dataset-workflow-yamls -n biodiversity --ignore-not-found=true

Delete Output Data

# Using rclone
rclone purge nrp:bucket-name

# Or specific paths
rclone delete nrp:bucket-name/dataset.parquet
rclone purge nrp:bucket-name/hex/

Advanced Usage

Custom Resource Limits

Edit individual job YAML files:

resources:
  requests:
    cpu: "2"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "16Gi"

Multiple Datasets

The generic RBAC can be shared:

# Apply RBAC once
kubectl apply -f workflow-rbac.yaml

# Run multiple datasets
kubectl apply -f dataset1/workflow.yaml
kubectl apply -f dataset2/workflow.yaml

Each dataset gets its own PVC for YAML files.

Design Rationale

Why PVC Instead of Git?

  • No dev repository dependency

  • Stateless K8s execution

  • No initContainer overhead

Why Not ConfigMaps?

  • Better readability (files not embedded strings)

  • No 1MB size limit

  • Easier to inspect and modify

Why Generic RBAC?

  • One ServiceAccount per namespace

  • Consistent permissions

  • No per-dataset RBAC proliferation