Examples¶
Real-world examples of using the CNG Datasets toolkit.
Vector Processing Examples¶
Protected Areas (WDPA)¶
Large-scale protected areas processing:
from cng_datasets.vector import H3VectorProcessor
processor = H3VectorProcessor(
input_url="s3://public-wdpa/wdpa.parquet",
output_url="s3://public-wdpa/hex/",
h3_resolution=10,
parent_resolutions=[9, 8, 0],
chunk_size=500,
)
output_files = processor.process_all_chunks()
See: wdpa/ directory
Redlining Data¶
Historical redlining polygon processing:
from cng_datasets.vector import H3VectorProcessor
processor = H3VectorProcessor(
input_url="s3://public-redlining/redlining.parquet",
output_url="s3://public-redlining/hex/",
h3_resolution=10,
parent_resolutions=[9, 8, 7, 0],
chunk_size=100,
)
output_files = processor.process_all_chunks()
See: redlining/ directory
PAD-US¶
Protected Areas Database of the United States:
from cng_datasets.vector import H3VectorProcessor
processor = H3VectorProcessor(
input_url="s3://public-padus/padus.parquet",
output_url="s3://public-padus/hex/",
h3_resolution=10,
parent_resolutions=[9, 8, 0],
)
output_files = processor.process_all_chunks()
See: pad-us/ directory
Raster Processing Examples¶
Global Wetlands (GLWD)¶
Global wetlands raster to H3:
from cng_datasets.raster import RasterProcessor
# Create COG
processor = RasterProcessor(
input_path="/vsis3/public-wetlands/glwd.tif",
output_cog_path="s3://public-wetlands/glwd-cog.tif",
compression="zstd",
)
processor.create_cog()
# Convert to H3 by h0 regions
for h0_index in range(122):
processor = RasterProcessor(
input_path="/vsis3/public-wetlands/glwd.tif",
output_parquet_path="s3://public-wetlands/hex/",
h0_index=h0_index,
h3_resolution=8,
parent_resolutions=[0],
value_column="wetland_class",
nodata_value=255,
)
processor.process_h0_region()
See: wetlands/glwd/ directory
IUCN Range Maps¶
Species range maps processing:
from cng_datasets.raster import RasterProcessor
import glob
# Process each species raster
for raster_file in glob.glob("species/*.tif"):
processor = RasterProcessor(
input_path=raster_file,
output_cog_path=f"s3://public-iucn/{species}-cog.tif",
output_parquet_path=f"s3://public-iucn/{species}/hex/",
h3_resolution=9,
parent_resolutions=[8, 0],
value_column="presence",
)
processor.create_cog()
processor.process_all_h0_regions()
See: iucn/ directory
Kubernetes Examples¶
Vector Dataset Workflow¶
# Generate workflow
cng-datasets workflow \
--dataset wdpa \
--source-url https://d1gam3xoknrgr2.cloudfront.net/current/WDPA_WDOECM_Nov2024_Public_all_gdb.zip \
--bucket public-wdpa \
--h3-resolution 10 \
--parent-resolutions "9,8,0" \
--namespace biodiversity \
--output-dir wdpa/k8s/
# Run workflow
kubectl apply -f wdpa/k8s/workflow-rbac.yaml
kubectl apply -f wdpa/k8s/workflow-pvc.yaml
kubectl apply -f wdpa/k8s/workflow.yaml
Raster Dataset Workflow¶
from cng_datasets.k8s import K8sJobManager
# Generate indexed job for h0 regions
manager = K8sJobManager(namespace="biodiversity")
job = manager.generate_chunked_job(
job_name="wetlands-h3",
script_path="/app/wetlands/glwd/job.py",
num_chunks=122, # One per h0 region
base_args=[
"--input-url", "/vsis3/public-wetlands/glwd.tif",
"--output-url", "s3://public-wetlands/hex/",
"--parent-resolutions", "8,0",
],
parallelism=61,
cpu="4",
memory="34Gi",
)
manager.save_job_yaml(job, "wetlands/k8s/hex-job.yaml")
Multi-Provider Sync¶
Sync datasets across cloud providers:
from cng_datasets.storage import RcloneSync
syncer = RcloneSync()
# Sync from AWS to Cloudflare R2
syncer.sync(
source="aws:public-wdpa/",
destination="cloudflare:public-wdpa/",
args=["--progress"]
)
# Sync from AWS to Google Cloud Storage
syncer.sync(
source="aws:public-wdpa/",
destination="gcs:public-wdpa/",
)
See: Individual dataset directories for complete examples and job scripts.