# Contributing Thank you for your interest in contributing to the CNG Datasets toolkit! ## Development Setup 1. **Clone the repository:** ```bash git clone https://github.com/boettiger-lab/datasets.git cd datasets ``` 2. **Create a virtual environment:** ```bash python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate ``` 3. **Install in development mode:** ```bash pip install -e ".[dev]" ``` 4. **Install pre-commit hooks (optional):** ```bash pip install pre-commit pre-commit install ``` ## Code Style We use: - **black** for code formatting (line length: 100) - **ruff** for linting - **mypy** for type checking Format your code before committing: ```bash black cng_datasets/ ruff check cng_datasets/ mypy cng_datasets/ ``` ## Testing Run tests with pytest: ```bash # Run all tests pytest # Run with coverage pytest --cov=cng_datasets # Run specific test file pytest tests/test_vector.py # Run specific test pytest tests/test_vector.py::test_h3_processor ``` ### Writing Tests - Place tests in the `tests/` directory - Name test files `test_*.py` - Use descriptive test names: `test_processor_handles_empty_input` - Use fixtures for common setup - Mock external dependencies (S3, Kubernetes) Example: ```python import pytest from cng_datasets.vector import H3VectorProcessor def test_processor_validates_resolution(): with pytest.raises(ValueError): processor = H3VectorProcessor( input_url="test.parquet", output_url="output/", h3_resolution=20, # Invalid resolution ) ``` ## Documentation Documentation is built with Sphinx and hosted on GitHub Pages. ### Build Documentation Locally ```bash cd docs/ pip install sphinx furo myst-parser make html ``` View at `docs/_build/html/index.html` ### Documentation Guidelines - Use Markdown for user guides - Use reStructuredText for API docs - Include code examples - Add docstrings to all public functions/classes - Follow Google docstring style Example docstring: ```python def process_chunk(self, chunk_id: int) -> str: """Process a specific chunk of the dataset. Args: chunk_id: Zero-based chunk index to process Returns: Path to the output parquet file Raises: ValueError: If chunk_id is out of range Example: >>> processor = H3VectorProcessor(...) >>> output = processor.process_chunk(0) """ ``` ## Pull Request Process 1. **Create a feature branch:** ```bash git checkout -b feature/my-feature ``` 2. **Make your changes:** - Write clean, documented code - Add tests for new functionality - Update documentation 3. **Run tests and linting:** ```bash pytest black cng_datasets/ ruff check cng_datasets/ ``` 4. **Commit your changes:** ```bash git add . git commit -m "Add feature: description" ``` 5. **Push and create PR:** ```bash git push origin feature/my-feature ``` Then create a Pull Request on GitHub. ### PR Checklist - [ ] Code follows style guidelines - [ ] Tests pass - [ ] New tests added for new features - [ ] Documentation updated - [ ] CHANGELOG.md updated - [ ] Commit messages are clear ## Reporting Issues Use GitHub Issues to report bugs or request features. ### Bug Reports Include: - Description of the bug - Steps to reproduce - Expected behavior - Actual behavior - Python version and OS - Relevant logs or error messages ### Feature Requests Include: - Clear description of the feature - Use cases - Example API (if applicable) ## Code of Conduct - Be respectful and inclusive - Welcome newcomers - Focus on constructive feedback - Assume good intentions ## Questions? - Open a GitHub Issue for bugs/features - Start a Discussion for questions - Check existing issues before creating new ones ## License By contributing, you agree that your contributions will be licensed under the MIT License.