Vector Processing

Convert polygon and point datasets to H3-indexed GeoParquet format.

Overview

The vector processing module provides tools to convert geospatial vector data into H3-indexed parquet files. This is particularly useful for:

  • Large polygon datasets (e.g., protected areas, administrative boundaries)

  • Point datasets (e.g., species observations)

  • Datasets that need hierarchical aggregation at multiple H3 resolutions

Basic Usage

Python API

from cng_datasets.vector import H3VectorProcessor

processor = H3VectorProcessor(
    input_url="s3://my-bucket/polygons.parquet",
    output_url="s3://my-bucket/h3-indexed/",
    h3_resolution=10,
    parent_resolutions=[9, 8, 0],
    chunk_size=500,
    intermediate_chunk_size=10
)

# Process all chunks
output_files = processor.process_all_chunks()

# Or process a specific chunk (useful for parallel processing)
output_file = processor.process_chunk(chunk_id=5)

Command-Line Interface

# Process entire dataset
cng-datasets vector \
    --input s3://bucket/input.parquet \
    --output s3://bucket/output/ \
    --resolution 10 \
    --chunk-size 500

# Process specific chunk
cng-datasets vector \
    --input s3://bucket/input.parquet \
    --output s3://bucket/output/ \
    --chunk-id 0 \
    --intermediate-chunk-size 5

Two-Pass Processing

The toolkit uses a memory-efficient two-pass approach to handle large polygons:

Pass 1: Convert to H3 Arrays

  • Converts geometries to H3 cell arrays (no unnesting)

  • Writes to intermediate file

  • Memory-efficient for complex polygons

Pass 2: Unnest and Write

  • Reads arrays in small batches

  • Unnests them into individual H3 cells

  • Writes final output

This prevents OOM errors when processing large polygons at high H3 resolutions. If you still hit memory limits, reduce intermediate_chunk_size (default: 10).

Parameters

H3VectorProcessor

  • input_url (str): Path to input GeoParquet file

  • output_url (str): Path to output directory

  • h3_resolution (int): Primary H3 resolution (default: 10)

  • parent_resolutions (list[int]): Parent resolutions for aggregation (default: [9, 8, 0])

  • chunk_size (int): Number of rows to process at once in Pass 1 (default: 500)

  • intermediate_chunk_size (int): Number of array rows to unnest at once in Pass 2 (default: 10)

  • read_credentials (dict, optional): S3 credentials for reading

  • write_credentials (dict, optional): S3 credentials for writing

Chunked Processing

For large datasets, process in chunks to avoid memory issues:

# Process in parallel using Kubernetes
from cng_datasets.k8s import K8sJobManager

manager = K8sJobManager()
job = manager.generate_chunked_job(
    job_name="dataset-h3-tiling",
    script_path="/app/tile_vectors.py",
    num_chunks=100,
    parallelism=20
)
manager.save_job_yaml(job, "tiling-job.yaml")

Memory Optimization

If you encounter Out-Of-Memory errors:

  1. Reduce chunk_size: Processes fewer rows in Pass 1

  2. Reduce intermediate_chunk_size: Unnests fewer arrays in Pass 2

  3. Use lower H3 resolution: Fewer cells per geometry

  4. Process in chunks: Use chunk_id parameter for parallel processing

Example for memory-constrained environments:

processor = H3VectorProcessor(
    input_url="s3://bucket/large-polygons.parquet",
    output_url="s3://bucket/output/",
    h3_resolution=10,
    chunk_size=100,  # Reduced from default 500
    intermediate_chunk_size=5  # Reduced from default 10
)

Output Format

Output is partitioned by h0 (continent-scale) H3 cells:

s3://bucket/output/
└── h0=0/
    └── chunk_0.parquet
└── h0=1/
    └── chunk_0.parquet
...

Each parquet file contains:

  • h3_cell: H3 cell ID at specified resolution

  • Original attributes from input dataset

  • Parent H3 cells if parent_resolutions specified

Examples

See the following directories for complete examples:

  • redlining/ - Vector polygon processing with chunking

  • wdpa/ - Large-scale protected areas processing

  • pad-us/ - Protected areas database H3 tiling