Configuration¶
Configure credentials and settings for cloud storage and processing.
S3 Credentials¶
The toolkit supports multiple authentication methods for S3 access.
Environment Variables¶
Set AWS credentials as environment variables:
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-west-2"
Using cng.utils¶
If you have the cng package installed:
from cng.utils import set_secrets, setup_duckdb_connection
con = setup_duckdb_connection()
set_secrets(con)
Manual Configuration¶
Pass credentials directly to processors:
from cng_datasets.vector import H3VectorProcessor
processor = H3VectorProcessor(
input_url="s3://bucket/input.parquet",
output_url="s3://bucket/output/",
read_credentials={
"key": "ACCESS_KEY",
"secret": "SECRET_KEY",
"region": "us-west-2"
},
write_credentials={
"key": "ACCESS_KEY",
"secret": "SECRET_KEY",
"region": "us-west-2"
}
)
Kubernetes Secrets¶
For Kubernetes workflows, use secrets:
# Create secret
kubectl create secret generic aws-credentials \
--from-literal=AWS_ACCESS_KEY_ID=your-key \
--from-literal=AWS_SECRET_ACCESS_KEY=your-secret \
-n biodiversity
# Reference in job
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_SECRET_ACCESS_KEY
Rclone Configuration¶
Configure rclone for syncing between cloud providers.
Configuration File¶
Create ~/.config/rclone/rclone.conf:
[aws]
type = s3
provider = AWS
access_key_id = your-access-key
secret_access_key = your-secret-key
region = us-west-2
[cloudflare]
type = s3
provider = Cloudflare
access_key_id = your-r2-access-key
secret_access_key = your-r2-secret-key
endpoint = https://your-account-id.r2.cloudflarestorage.com
Python API¶
from cng_datasets.storage import RcloneSync
# Use default config
syncer = RcloneSync()
# Or specify custom config
syncer = RcloneSync(config_path="/path/to/rclone.conf")
# Sync between remotes
syncer.sync(
source="aws:public-dataset/",
destination="cloudflare:public-dataset/"
)
Command-Line¶
cng-datasets storage sync \
--source aws:bucket/data \
--destination cloudflare:bucket/data
Bucket CORS Configuration¶
Configure CORS for public bucket access:
from cng_datasets.storage import configure_bucket_cors
configure_bucket_cors(
bucket="my-public-bucket",
endpoint="https://s3.amazonaws.com"
)
Or use command-line:
cng-datasets storage cors \
--bucket my-public-bucket \
--endpoint https://s3.amazonaws.com
Docker Configuration¶
Build Custom Image¶
FROM ghcr.io/boettiger-lab/datasets:latest
# Add custom dependencies
RUN pip install my-package
# Copy custom scripts
COPY scripts/ /app/
Mount Credentials¶
# Mount AWS credentials
docker run --rm \
-v ~/.aws:/root/.aws:ro \
-v $(pwd):/data \
ghcr.io/boettiger-lab/datasets:latest \
cng-datasets raster --input /data/input.tif
# Use environment variables
docker run --rm \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-v $(pwd):/data \
ghcr.io/boettiger-lab/datasets:latest \
cng-datasets raster --input /data/input.tif
GDAL Configuration¶
For raster processing with GDAL:
Virtual File Systems¶
Use /vsis3/ for direct S3 access:
from cng_datasets.raster import RasterProcessor
processor = RasterProcessor(
input_path="/vsis3/bucket/data.tif", # Direct S3 access
output_cog_path="s3://bucket/output.tif",
)
GDAL Options¶
Configure GDAL behavior:
import os
# Set GDAL options
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"
os.environ["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif,.tiff"
os.environ["GDAL_HTTP_MAX_RETRY"] = "3"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "5"
Performance Tuning¶
Memory Settings¶
# Vector processing
processor = H3VectorProcessor(
input_url="s3://bucket/data.parquet",
output_url="s3://bucket/output/",
chunk_size=100, # Reduce for memory-constrained environments
intermediate_chunk_size=5,
)
# Raster processing
from cng_datasets.raster import RasterProcessor
processor = RasterProcessor(
input_path="data.tif",
blocksize=256, # Smaller tiles for less memory
)
Parallelism¶
# Kubernetes job parallelism
cng-datasets workflow \
--dataset my-dataset \
--parallelism 50 # Adjust based on cluster capacity
Compression¶
# Vector output
processor = H3VectorProcessor(
input_url="s3://bucket/data.parquet",
output_url="s3://bucket/output/",
compression="zstd", # or "snappy", "gzip"
)
# COG compression
processor = RasterProcessor(
input_path="data.tif",
output_cog_path="s3://bucket/output.tif",
compression="zstd", # or "deflate", "lzw"
)