Vector Processing¶

Convert polygon and point datasets to H3-indexed GeoParquet format.

Overview¶

The vector processing module provides tools to convert geospatial vector data into H3-indexed parquet files. This is particularly useful for:

Large polygon datasets (e.g., protected areas, administrative boundaries)
Point datasets (e.g., species observations)
Datasets that need hierarchical aggregation at multiple H3 resolutions

Basic Usage¶

Python API¶

from cng_datasets.vector import H3VectorProcessor

processor = H3VectorProcessor(
    input_url="s3://my-bucket/polygons.parquet",
    output_url="s3://my-bucket/h3-indexed/",
    h3_resolution=10,
    parent_resolutions=[9, 8, 0],
    chunk_size=500,
    intermediate_chunk_size=10
)

# Process all chunks
output_files = processor.process_all_chunks()

# Or process a specific chunk (useful for parallel processing)
output_file = processor.process_chunk(chunk_id=5)

Command-Line Interface¶

# Process entire dataset
cng-datasets vector \
    --input s3://bucket/input.parquet \
    --output s3://bucket/output/ \
    --resolution 10 \
    --chunk-size 500

# Process specific chunk
cng-datasets vector \
    --input s3://bucket/input.parquet \
    --output s3://bucket/output/ \
    --chunk-id 0 \
    --intermediate-chunk-size 5

Two-Pass Processing¶

The toolkit uses a memory-efficient two-pass approach to handle large polygons:

Pass 1: Convert to H3 Arrays¶

Converts geometries to H3 cell arrays (no unnesting)
Writes to intermediate file
Memory-efficient for complex polygons

Pass 2: Unnest and Write¶

Reads arrays in small batches
Unnests them into individual H3 cells
Writes final output

This prevents OOM errors when processing large polygons at high H3 resolutions. If you still hit memory limits, reduce intermediate_chunk_size (default: 10).

Parameters¶

H3VectorProcessor¶

input_url (str): Path to input GeoParquet file
output_url (str): Path to output directory
h3_resolution (int): Primary H3 resolution (default: 10)
parent_resolutions (list[int]): Parent resolutions for aggregation (default: [9, 8, 0])
chunk_size (int): Number of rows to process at once in Pass 1 (default: 500)
intermediate_chunk_size (int): Number of array rows to unnest at once in Pass 2 (default: 10)
read_credentials (dict, optional): S3 credentials for reading
write_credentials (dict, optional): S3 credentials for writing

Chunked Processing¶

For large datasets, process in chunks to avoid memory issues:

# Process in parallel using Kubernetes
from cng_datasets.k8s import K8sJobManager

manager = K8sJobManager()
job = manager.generate_chunked_job(
    job_name="dataset-h3-tiling",
    script_path="/app/tile_vectors.py",
    num_chunks=100,
    parallelism=20
)
manager.save_job_yaml(job, "tiling-job.yaml")

Memory Optimization¶

If you encounter Out-Of-Memory errors:

Reduce chunk_size: Processes fewer rows in Pass 1
Reduce intermediate_chunk_size: Unnests fewer arrays in Pass 2
Use lower H3 resolution: Fewer cells per geometry
Process in chunks: Use chunk_id parameter for parallel processing

Example for memory-constrained environments:

processor = H3VectorProcessor(
    input_url="s3://bucket/large-polygons.parquet",
    output_url="s3://bucket/output/",
    h3_resolution=10,
    chunk_size=100,  # Reduced from default 500
    intermediate_chunk_size=5  # Reduced from default 10
)

Output Format¶

Output is partitioned by h0 (continent-scale) H3 cells:

s3://bucket/output/
└── h0=0/
    └── chunk_0.parquet
└── h0=1/
    └── chunk_0.parquet
...

Each parquet file contains:

h3_cell: H3 cell ID at specified resolution
Original attributes from input dataset
Parent H3 cells if parent_resolutions specified

Examples¶

See the following directories for complete examples:

redlining/ - Vector polygon processing with chunking
wdpa/ - Large-scale protected areas processing
pad-us/ - Protected areas database H3 tiling