Contributing¶

Thank you for your interest in contributing to the CNG Datasets toolkit!

Development Setup¶

Clone the repository:

git clone https://github.com/boettiger-lab/datasets.git
cd datasets

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install in development mode:

pip install -e ".[dev]"

Install pre-commit hooks (optional):

pip install pre-commit
pre-commit install

Code Style¶

We use:

black for code formatting (line length: 100)
ruff for linting
mypy for type checking

Format your code before committing:

black cng_datasets/
ruff check cng_datasets/
mypy cng_datasets/

Testing¶

Run tests with pytest:

# Run all tests
pytest

# Run with coverage
pytest --cov=cng_datasets

# Run specific test file
pytest tests/test_vector.py

# Run specific test
pytest tests/test_vector.py::test_h3_processor

Writing Tests¶

Place tests in the tests/ directory
Name test files test_*.py
Use descriptive test names: test_processor_handles_empty_input
Use fixtures for common setup
Mock external dependencies (S3, Kubernetes)

Example:

import pytest
from cng_datasets.vector import H3VectorProcessor

def test_processor_validates_resolution():
    with pytest.raises(ValueError):
        processor = H3VectorProcessor(
            input_url="test.parquet",
            output_url="output/",
            h3_resolution=20,  # Invalid resolution
        )

Documentation¶

Documentation is built with Sphinx and hosted on GitHub Pages.

Build Documentation Locally¶

cd docs/
pip install sphinx furo myst-parser
make html

View at docs/_build/html/index.html

Documentation Guidelines¶

Use Markdown for user guides
Use reStructuredText for API docs
Include code examples
Add docstrings to all public functions/classes
Follow Google docstring style

Example docstring:

def process_chunk(self, chunk_id: int) -> str:
    """Process a specific chunk of the dataset.
    
    Args:
        chunk_id: Zero-based chunk index to process
        
    Returns:
        Path to the output parquet file
        
    Raises:
        ValueError: If chunk_id is out of range
        
    Example:
        >>> processor = H3VectorProcessor(...)
        >>> output = processor.process_chunk(0)
    """

Pull Request Process¶

Create a feature branch:

git checkout -b feature/my-feature

Make your changes:
- Write clean, documented code
- Add tests for new functionality
- Update documentation
Run tests and linting:

pytest
black cng_datasets/
ruff check cng_datasets/

Commit your changes:

git add .
git commit -m "Add feature: description"

Push and create PR:

git push origin feature/my-feature

Then create a Pull Request on GitHub.

Reporting Issues¶

Use GitHub Issues to report bugs or request features.

Bug Reports¶

Include:

Description of the bug
Steps to reproduce
Expected behavior
Actual behavior
Python version and OS
Relevant logs or error messages

Feature Requests¶