# Configuration Configure credentials and settings for cloud storage and processing. ## S3 Credentials The toolkit supports multiple authentication methods for S3 access. ### Environment Variables Set AWS credentials as environment variables: ```bash export AWS_ACCESS_KEY_ID="your-access-key" export AWS_SECRET_ACCESS_KEY="your-secret-key" export AWS_DEFAULT_REGION="us-west-2" ``` ### Using cng.utils If you have the `cng` package installed: ```python from cng.utils import set_secrets, setup_duckdb_connection con = setup_duckdb_connection() set_secrets(con) ``` ### Manual Configuration Pass credentials directly to processors: ```python from cng_datasets.vector import H3VectorProcessor processor = H3VectorProcessor( input_url="s3://bucket/input.parquet", output_url="s3://bucket/output/", read_credentials={ "key": "ACCESS_KEY", "secret": "SECRET_KEY", "region": "us-west-2" }, write_credentials={ "key": "ACCESS_KEY", "secret": "SECRET_KEY", "region": "us-west-2" } ) ``` ### Kubernetes Secrets For Kubernetes workflows, use secrets: ```bash # Create secret kubectl create secret generic aws-credentials \ --from-literal=AWS_ACCESS_KEY_ID=your-key \ --from-literal=AWS_SECRET_ACCESS_KEY=your-secret \ -n biodiversity # Reference in job env: - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: aws-credentials key: AWS_ACCESS_KEY_ID - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: aws-credentials key: AWS_SECRET_ACCESS_KEY ``` ## Rclone Configuration Configure rclone for syncing between cloud providers. ### Configuration File Create `~/.config/rclone/rclone.conf`: ```ini [aws] type = s3 provider = AWS access_key_id = your-access-key secret_access_key = your-secret-key region = us-west-2 [cloudflare] type = s3 provider = Cloudflare access_key_id = your-r2-access-key secret_access_key = your-r2-secret-key endpoint = https://your-account-id.r2.cloudflarestorage.com ``` ### Python API ```python from cng_datasets.storage import RcloneSync # Use default config syncer = RcloneSync() # Or specify custom config syncer = RcloneSync(config_path="/path/to/rclone.conf") # Sync between remotes syncer.sync( source="aws:public-dataset/", destination="cloudflare:public-dataset/" ) ``` ### Command-Line ```bash cng-datasets storage sync \ --source aws:bucket/data \ --destination cloudflare:bucket/data ``` ## Bucket CORS Configuration Configure CORS for public bucket access: ```python from cng_datasets.storage import configure_bucket_cors configure_bucket_cors( bucket="my-public-bucket", endpoint="https://s3.amazonaws.com" ) ``` Or use command-line: ```bash cng-datasets storage cors \ --bucket my-public-bucket \ --endpoint https://s3.amazonaws.com ``` ## Docker Configuration ### Build Custom Image ```dockerfile FROM ghcr.io/boettiger-lab/datasets:latest # Add custom dependencies RUN pip install my-package # Copy custom scripts COPY scripts/ /app/ ``` ### Mount Credentials ```bash # Mount AWS credentials docker run --rm \ -v ~/.aws:/root/.aws:ro \ -v $(pwd):/data \ ghcr.io/boettiger-lab/datasets:latest \ cng-datasets raster --input /data/input.tif # Use environment variables docker run --rm \ -e AWS_ACCESS_KEY_ID \ -e AWS_SECRET_ACCESS_KEY \ -v $(pwd):/data \ ghcr.io/boettiger-lab/datasets:latest \ cng-datasets raster --input /data/input.tif ``` ## GDAL Configuration For raster processing with GDAL: ### Virtual File Systems Use `/vsis3/` for direct S3 access: ```python from cng_datasets.raster import RasterProcessor processor = RasterProcessor( input_path="/vsis3/bucket/data.tif", # Direct S3 access output_cog_path="s3://bucket/output.tif", ) ``` ### GDAL Options Configure GDAL behavior: ```python import os # Set GDAL options os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR" os.environ["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif,.tiff" os.environ["GDAL_HTTP_MAX_RETRY"] = "3" os.environ["GDAL_HTTP_RETRY_DELAY"] = "5" ``` ## Performance Tuning ### Memory Settings ```python # Vector processing processor = H3VectorProcessor( input_url="s3://bucket/data.parquet", output_url="s3://bucket/output/", chunk_size=100, # Reduce for memory-constrained environments intermediate_chunk_size=5, ) # Raster processing from cng_datasets.raster import RasterProcessor processor = RasterProcessor( input_path="data.tif", blocksize=256, # Smaller tiles for less memory ) ``` ### Parallelism ```bash # Kubernetes job parallelism cng-datasets workflow \ --dataset my-dataset \ --parallelism 50 # Adjust based on cluster capacity ``` ### Compression ```python # Vector output processor = H3VectorProcessor( input_url="s3://bucket/data.parquet", output_url="s3://bucket/output/", compression="zstd", # or "snappy", "gzip" ) # COG compression processor = RasterProcessor( input_path="data.tif", output_cog_path="s3://bucket/output.tif", compression="zstd", # or "deflate", "lzw" ) ```