Vector Processing¶
Convert polygon and point datasets to H3-indexed GeoParquet format.
Overview¶
The vector processing module provides tools to convert geospatial vector data into H3-indexed parquet files. This is particularly useful for:
Large polygon datasets (e.g., protected areas, administrative boundaries)
Point datasets (e.g., species observations)
Datasets that need hierarchical aggregation at multiple H3 resolutions
Basic Usage¶
Python API¶
from cng_datasets.vector import H3VectorProcessor
processor = H3VectorProcessor(
input_url="s3://my-bucket/polygons.parquet",
output_url="s3://my-bucket/h3-indexed/",
h3_resolution=10,
parent_resolutions=[9, 8, 0],
chunk_size=500,
intermediate_chunk_size=10
)
# Process all chunks
output_files = processor.process_all_chunks()
# Or process a specific chunk (useful for parallel processing)
output_file = processor.process_chunk(chunk_id=5)
Command-Line Interface¶
# Process entire dataset
cng-datasets vector \
--input s3://bucket/input.parquet \
--output s3://bucket/output/ \
--resolution 10 \
--chunk-size 500
# Process specific chunk
cng-datasets vector \
--input s3://bucket/input.parquet \
--output s3://bucket/output/ \
--chunk-id 0 \
--intermediate-chunk-size 5
Two-Pass Processing¶
The toolkit uses a memory-efficient two-pass approach to handle large polygons:
Pass 1: Convert to H3 Arrays¶
Converts geometries to H3 cell arrays (no unnesting)
Writes to intermediate file
Memory-efficient for complex polygons
Pass 2: Unnest and Write¶
Reads arrays in small batches
Unnests them into individual H3 cells
Writes final output
This prevents OOM errors when processing large polygons at high H3 resolutions. If you still hit memory limits, reduce intermediate_chunk_size (default: 10).
Parameters¶
H3VectorProcessor¶
input_url(str): Path to input GeoParquet fileoutput_url(str): Path to output directoryh3_resolution(int): Primary H3 resolution (default: 10)parent_resolutions(list[int]): Parent resolutions for aggregation (default: [9, 8, 0])chunk_size(int): Number of rows to process at once in Pass 1 (default: 500)intermediate_chunk_size(int): Number of array rows to unnest at once in Pass 2 (default: 10)read_credentials(dict, optional): S3 credentials for readingwrite_credentials(dict, optional): S3 credentials for writing
Chunked Processing¶
For large datasets, process in chunks to avoid memory issues:
# Process in parallel using Kubernetes
from cng_datasets.k8s import K8sJobManager
manager = K8sJobManager()
job = manager.generate_chunked_job(
job_name="dataset-h3-tiling",
script_path="/app/tile_vectors.py",
num_chunks=100,
parallelism=20
)
manager.save_job_yaml(job, "tiling-job.yaml")
Memory Optimization¶
If you encounter Out-Of-Memory errors:
Reduce
chunk_size: Processes fewer rows in Pass 1Reduce
intermediate_chunk_size: Unnests fewer arrays in Pass 2Use lower H3 resolution: Fewer cells per geometry
Process in chunks: Use
chunk_idparameter for parallel processing
Example for memory-constrained environments:
processor = H3VectorProcessor(
input_url="s3://bucket/large-polygons.parquet",
output_url="s3://bucket/output/",
h3_resolution=10,
chunk_size=100, # Reduced from default 500
intermediate_chunk_size=5 # Reduced from default 10
)
Output Format¶
Output is partitioned by h0 (continent-scale) H3 cells:
s3://bucket/output/
└── h0=0/
└── chunk_0.parquet
└── h0=1/
└── chunk_0.parquet
...
Each parquet file contains:
h3_cell: H3 cell ID at specified resolutionOriginal attributes from input dataset
Parent H3 cells if
parent_resolutionsspecified
Examples¶
See the following directories for complete examples:
redlining/- Vector polygon processing with chunkingwdpa/- Large-scale protected areas processingpad-us/- Protected areas database H3 tiling