# Vector Processing

Convert polygon and point datasets to H3-indexed GeoParquet format.

## Overview

The vector processing module provides tools to convert geospatial vector data into H3-indexed parquet files. This is particularly useful for:

- Large polygon datasets (e.g., protected areas, administrative boundaries)
- Point datasets (e.g., species observations)
- Datasets that need hierarchical aggregation at multiple H3 resolutions

## Basic Usage

### Python API

```python
from cng_datasets.vector import H3VectorProcessor

processor = H3VectorProcessor(
    input_url="s3://my-bucket/polygons.parquet",
    output_url="s3://my-bucket/h3-indexed/",
    h3_resolution=10,
    parent_resolutions=[9, 8, 0],
    chunk_size=500,
    intermediate_chunk_size=10
)

# Process all chunks
output_files = processor.process_all_chunks()

# Or process a specific chunk (useful for parallel processing)
output_file = processor.process_chunk(chunk_id=5)
```

### Command-Line Interface

```bash
# Process entire dataset
cng-datasets vector \
    --input s3://bucket/input.parquet \
    --output s3://bucket/output/ \
    --resolution 10 \
    --chunk-size 500

# Process specific chunk
cng-datasets vector \
    --input s3://bucket/input.parquet \
    --output s3://bucket/output/ \
    --chunk-id 0 \
    --intermediate-chunk-size 5
```

## Two-Pass Processing

The toolkit uses a memory-efficient two-pass approach to handle large polygons:

### Pass 1: Convert to H3 Arrays
- Converts geometries to H3 cell arrays (no unnesting)
- Writes to intermediate file
- Memory-efficient for complex polygons

### Pass 2: Unnest and Write
- Reads arrays in small batches
- Unnests them into individual H3 cells
- Writes final output

This prevents OOM errors when processing large polygons at high H3 resolutions. If you still hit memory limits, reduce `intermediate_chunk_size` (default: 10).

## Parameters

### H3VectorProcessor

- `input_url` (str): Path to input GeoParquet file
- `output_url` (str): Path to output directory
- `h3_resolution` (int): Primary H3 resolution (default: 10)
- `parent_resolutions` (list[int]): Parent resolutions for aggregation (default: [9, 8, 0])
- `chunk_size` (int): Number of rows to process at once in Pass 1 (default: 500)
- `intermediate_chunk_size` (int): Number of array rows to unnest at once in Pass 2 (default: 10)
- `read_credentials` (dict, optional): S3 credentials for reading
- `write_credentials` (dict, optional): S3 credentials for writing

## Chunked Processing

For large datasets, process in chunks to avoid memory issues:

```python
# Process in parallel using Kubernetes
from cng_datasets.k8s import K8sJobManager

manager = K8sJobManager()
job = manager.generate_chunked_job(
    job_name="dataset-h3-tiling",
    script_path="/app/tile_vectors.py",
    num_chunks=100,
    parallelism=20
)
manager.save_job_yaml(job, "tiling-job.yaml")
```

## Memory Optimization

If you encounter Out-Of-Memory errors:

1. **Reduce `chunk_size`**: Processes fewer rows in Pass 1
2. **Reduce `intermediate_chunk_size`**: Unnests fewer arrays in Pass 2
3. **Use lower H3 resolution**: Fewer cells per geometry
4. **Process in chunks**: Use `chunk_id` parameter for parallel processing

Example for memory-constrained environments:

```python
processor = H3VectorProcessor(
    input_url="s3://bucket/large-polygons.parquet",
    output_url="s3://bucket/output/",
    h3_resolution=10,
    chunk_size=100,  # Reduced from default 500
    intermediate_chunk_size=5  # Reduced from default 10
)
```

## Output Format

Output is partitioned by h0 (continent-scale) H3 cells:

```
s3://bucket/output/
└── h0=0/
    └── chunk_0.parquet
└── h0=1/
    └── chunk_0.parquet
...
```

Each parquet file contains:
- `h3_cell`: H3 cell ID at specified resolution
- Original attributes from input dataset
- Parent H3 cells if `parent_resolutions` specified

## Examples

See the following directories for complete examples:
- `redlining/` - Vector polygon processing with chunking
- `wdpa/` - Large-scale protected areas processing
- `pad-us/` - Protected areas database H3 tiling