# Kubernetes Workflows

Generate and run complete Kubernetes workflows for large-scale geospatial processing.

## Overview

The K8s workflow module generates complete workflows for processing datasets into cloud-native formats. The workflow uses a **PVC-based orchestration** approach that allows stateless execution entirely within Kubernetes.

## Architecture

### PVC-Based Orchestration

The workflow uses a Persistent Volume Claim (PVC) to store job YAML files, enabling:
- **Stateless execution**: No git repository dependencies
- **Laptop disconnect**: Workflows run entirely in cluster
- **Clean separation**: YAML files remain readable and separate
- **Reusability**: Same RBAC across all datasets in a namespace

### Workflow Components

1. **Processing Jobs**:
   - `convert-job.yaml` - Convert source format → GeoParquet
   - `pmtiles-job.yaml` - Generate PMTiles vector tiles
   - `hex-job.yaml` - H3 hexagonal tiling with automatic chunking
   - `repartition-job.yaml` - Consolidate chunks by h0 partition

2. **Orchestration Infrastructure**:
   - `workflow-rbac.yaml` - ServiceAccount/Role/RoleBinding (one per namespace)
   - `workflow-pvc.yaml` - PVC for storing YAML files
   - `workflow-upload.yaml` - Job for uploading YAMLs to PVC
   - `workflow.yaml` - Orchestrator job that applies jobs from PVC

## Quick Start

### 1. Generate Workflow Files

```bash
cng-datasets workflow \
  --dataset my-dataset \
  --source-url https://example.com/data.gpkg \
  --bucket public-my-dataset \
  --h3-resolution 10 \
  --parent-resolutions "9,8,0" \
  --namespace biodiversity \
  --output-dir my-dataset/
```

### 2. Run Complete Workflow

```bash
# One-time setup per namespace
kubectl apply -f my-dataset/workflow-rbac.yaml

# Create PVC for YAML storage
kubectl apply -f my-dataset/workflow-pvc.yaml

# Upload YAML files to PVC
kubectl apply -f my-dataset/workflow-upload.yaml
kubectl wait --for=condition=ready pod -l job-name=my-dataset-upload-yamls -n biodiversity

# Copy YAML files to PVC
POD=$(kubectl get pods -l job-name=my-dataset-upload-yamls -n biodiversity -o jsonpath='{.items[0].metadata.name}')
kubectl cp my-dataset/convert-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/pmtiles-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/hex-job.yaml $POD:/yamls/ -n biodiversity
kubectl cp my-dataset/repartition-job.yaml $POD:/yamls/ -n biodiversity

# Start orchestrator (laptop can disconnect after this)
kubectl apply -f my-dataset/workflow.yaml

# Monitor progress
kubectl logs -f job/my-dataset-workflow -n biodiversity
```

### 3. Run Jobs Individually (Alternative)

```bash
# Apply RBAC once
kubectl apply -f my-dataset/workflow-rbac.yaml

# Run each job manually
kubectl apply -f my-dataset/convert-job.yaml
kubectl apply -f my-dataset/pmtiles-job.yaml

# Wait for convert to finish before hex
kubectl wait --for=condition=complete job/my-dataset-convert -n biodiversity

kubectl apply -f my-dataset/hex-job.yaml
kubectl wait --for=condition=complete job/my-dataset-hex -n biodiversity

kubectl apply -f my-dataset/repartition-job.yaml
```

## Configuration

### H3 Resolutions

```bash
--h3-resolution 10              # Primary resolution (default: 10)
--parent-resolutions "9,8,0"    # Parent hexes for aggregation
```

**Resolution Reference:**
- h12: ~3m (building-level)
- h11: ~10m (lot-level)  
- h10: ~15m (street-level) - **default**
- h9: ~50m (block-level)
- h8: ~175m (neighborhood)
- h7: ~600m (district)
- h0: continent-scale (partitioning key)

### Namespace

```bash
--namespace biodiversity  # Kubernetes namespace (must exist)
```

All jobs and RBAC use the specified namespace.

### Chunking Behavior

The hex job automatically determines optimal chunking:
- Uses GDAL to count features from source URL
- Targets 200 completions with 50 parallelism
- Falls back to defaults if counting fails

## Processing Details

### Two-Pass H3 Approach

1. **Chunking Phase** (`hex-job.yaml`):
   - Process source data in parallel chunks
   - Write to temporary `s3://bucket/dataset-name/chunks/`
   - Each chunk contains all H3 resolutions

2. **Repartition Phase** (`repartition-job.yaml`):
   - Read all chunks
   - Repartition by h0 hexagon (continent-scale)
   - Write to final `s3://bucket/dataset-name/hex/`
   - Delete temporary `chunks/` directory

### Output Structure

```
s3://bucket/
├── dataset-name.parquet         # GeoParquet with all attributes
├── dataset-name.pmtiles         # PMTiles vector tiles
└── dataset-name/
    └── hex/                     # H3-indexed parquet (partitioned by h0)
        └── h0=0/
            └── *.parquet
        └── h0=1/
            └── *.parquet
        ...
```

## Monitoring & Debugging

### Check Status

```bash
# List all jobs
kubectl get jobs -n biodiversity

# Check specific job
kubectl describe job my-dataset-hex -n biodiversity

# View logs
kubectl logs job/my-dataset-workflow -n biodiversity
kubectl logs job/my-dataset-hex-0-xxxxx -n biodiversity

# List pods for a job
kubectl get pods -n biodiversity | grep my-dataset-hex
```

### Common Issues

**PVC Upload Fails:**
```bash
# Check uploader pod status
kubectl get pod -l job-name=my-dataset-upload-yamls -n biodiversity

# Check PVC status
kubectl get pvc my-dataset-workflow-yamls -n biodiversity
```

**Orchestrator Can't Apply Jobs:**
```bash
# Check RBAC permissions
kubectl get sa,role,rolebinding -n biodiversity | grep cng-datasets-workflow

# Check orchestrator logs
kubectl logs job/my-dataset-workflow -n biodiversity
```

**Job Stuck Pending:**
```bash
# Check resource limits and node capacity
kubectl describe pod <pod-name> -n biodiversity
```

## Cleanup

### Delete Jobs and Resources

```bash
# Delete all dataset jobs and PVC
kubectl delete job my-dataset-convert my-dataset-pmtiles \
  my-dataset-hex my-dataset-repartition \
  my-dataset-upload-yamls my-dataset-workflow \
  -n biodiversity --ignore-not-found=true

kubectl delete pvc my-dataset-workflow-yamls -n biodiversity --ignore-not-found=true
```

### Delete Output Data

```bash
# Using rclone
rclone purge nrp:bucket-name

# Or specific paths
rclone delete nrp:bucket-name/dataset.parquet
rclone purge nrp:bucket-name/hex/
```

## Advanced Usage

### Custom Resource Limits

Edit individual job YAML files:

```yaml
resources:
  requests:
    cpu: "2"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "16Gi"
```

### Multiple Datasets

The generic RBAC can be shared:

```bash
# Apply RBAC once
kubectl apply -f workflow-rbac.yaml

# Run multiple datasets
kubectl apply -f dataset1/workflow.yaml
kubectl apply -f dataset2/workflow.yaml
```

Each dataset gets its own PVC for YAML files.

## Design Rationale

### Why PVC Instead of Git?
- No dev repository dependency
- Stateless K8s execution
- No initContainer overhead

### Why Not ConfigMaps?
- Better readability (files not embedded strings)
- No 1MB size limit
- Easier to inspect and modify

### Why Generic RBAC?
- One ServiceAccount per namespace
- Consistent permissions
- No per-dataset RBAC proliferation