Getting started
Requirements
- Read: Cookiecutter Data Science Docs for a general overview
- Python 3.8+
- Conda
- Cookiecutter Python package >= 2.1.1:
This can be installed with
pip
by orconda
depending on how you manage your Python packages:
conda install -c conda-forge cookiecutter
conda install -c conda-forge conda-lock -n base
Creating new project
Run:
cookiecutter https://github.com/xultaeculcis/ml-project-cookiecutter
You will be prompted to provide project info one argument at a time:
project_name [project_name]: My ML project
repo_name [my-ml-project]:
src_dir_name [my_ml_project]:
author_name [Your name (or your organization/company/team)]: xultaeculcsis
repo_url [https://github.com/xultaeculcsis/my-ml-project]:
project_description [A short description of the project]: Just an ML project :)
Select license:
1 - MIT
2 - Apache 2.0
3 - BSD-3-Clause
4 - Beerware
5 - GLWTS
6 - Proprietary
7 - Empty license file
Choose from 1, 2, 3, 4, 5, 6, 7 [1]: 1
The repo_name
, src_dir_name
and repo_url
will be automatically standardized and provided for you.
You can change them to your liking though.
Working with the project
Project directory structure
The resulting project structure will look like this:
my-ml-project/
├── data
│ ├── analysis <- EDA artifacts.
│ ├── auxiliary <- The auxiliary, third party data.
│ ├── inference <- Inference results from your models.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── Dockerfile <- Dockerfile definition.
├── docs <- The mkdocs documentation sources.
│ ├── api_ref <- Source package docs.
│ │ ├── consts.md
│ │ ├── core
│ │ │ ├── configs.md
│ │ │ └── settings.md
│ │ └── utils.md
│ ├── guides <- How-to guides.
│ │ ├── contributing.md
│ │ ├── makefile-usage.md
│ │ ├── setup-dev-env.md
│ │ └── tests.md
│ ├── index.md <- Docs homepage.
│ └── __init__.py
├── env-dev.yaml <- Conda environment definition with development dependencies.
├── env.yaml <- Main Conda environment definition with only the necessary packages.
├── LICENSE <- The license file.
├── Makefile <- Makefile with commands like `make docs` or
│ `make pc`.
├── mkdocs.yml
├── my_ml_project <- Project source code. This will be different
│ │ depending on your input during project creation.
│ ├── consts <- Constants to be used across the project.
│ │ ├── __init__.py
│ │ ├── directories.py
│ │ ├── logging.py
│ │ └── reproducibility.py
│ ├── core <- Core project stuff. E.g., the base classes
│ │ │ for step entrypoint configs.
│ │ ├── configs
│ │ │ └── __init__.py
│ │ │ ├── argument_parsing.py
│ │ │ ├── base.py
│ │ ├── __init__.py
│ │ └── settings.py
│ ├── __init__.py
│ ├── py.typed
│ └── utils <- Utility functions and classes.
│ ├── __init__.py
│ ├── gpu.py
│ ├── logging.py
│ ├── mlflow.py
│ └── serialization.py
├── notebooks <- Jupyter notebooks. Naming convention is a
│ number (for ordering), the creator's initials,
│ and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
├── pyproject.toml <- Contains build system requirements
│ and information, which are used by pip to build
│ the package and project tooling configs.
├── README.md
├── setup.py
└── tests <- The tests directory.
├── conftest.py <- Contains test fixtures and utility functions.
├── e2e <- Contains end-to-end tests.
├── __init__.py
├── integration <- Contains integration tests.
└── unit <- Contains unit tests.
Most of those folders were described in detail in the Cookiecutter Data Science Docs.
Environment setup
You'll need to inti a git repo in your newly created project:
make git-init
Or:
git init
git add .
Via Makefile
Right after creating new project from the cookiecutter template you'll need to freeze the dependencies. Initial
conda env.yaml
has a minimal set of dependencies needed for the helper functions, test execution and docs creation.
Note that most of the conda dependencies are not pinned in the env.yaml
. This is done on purpose in order to ensure
that new projects can be created with the most up-to-date packages. Once you create the lock file, you can pin specific
versions.
By default, the Makefile
only supports the linux-64
platform. If your team works on multiple platforms you can add
those platforms to the conda-lock
command yourself.
To lock the environment run:
make lock-file
After creating the lock file you can create the conda environment by running:
make env
This command will set up the environment for you. It will also install pre-commit
hooks and the project in an editable
mode.
Once done, you can activate the environment by running:
conda activate <env-name>
By default, the <env-name>
created using the Makefile
will be equal to cookiecutter.repo_name
variable.
Note
If you are on Windows, the make
command will be unavailable. We recommend working
with WSL in that case.
For example for linux-64
the full list of commands (using Makefile
) would look like so:
make git-init
make lock-file
make env
conda activate <env-name>
Note
If you want to initialize Git repository, create lock-file and development environment in one go you can run:
make init-project
conda activate <env-name>
Manually
If you are not on Linux the setup via Makefile
might not work. In that case run the following commands manually.
But before, that please determine your platform:
win-64
osx-64
osx-arm64
linux-64
linux-aarch64
linux-ppc64l
To set up your local env from scratch run:
-
Create
conda-lock
file:conda-lock --mamba -f ./env.yaml -p <your-platform>
You can also create a lock-file for multiple platforms:
conda-lock --mamba -f ./env.yaml -p linux-64 -p osx-arm64 -p win-64
-
Create environment using
conda-lock
:conda-lock install --mamba -n <env-name> conda-lock.yml
-
Activate the env:
conda activate <env-name>
-
Install
pre-commit
hooks:pre-commit install
-
Install the project in an editable mode:
pip install -e .
It will use your conda-lock
installation to create a lock-file and create a brand new conda
environment named after
your repository.
Note
Once you've initialized git repo, created the lock file(s) and pinned the package versions, you should commit the
changes and push them to a remote repository as an Initial commit
.
Pre-commit hooks
This project uses pre-commit
package for managing and maintaining pre-commit
hooks.
To ensure code quality - please make sure that you have it configured.
-
Install
pre-commit
and following packages:isort
,black
,flake8
,mypy
,pytest
. -
Install
pre-commit
hooks by running:pre-commit install
-
The command above will automatically run formatters, code checks and other steps defined in the
.pre-commit-config.yaml
-
All of those checks will also be run whenever a new commit is being created i.e. when you run
git commit -m "blah"
-
You can also run it manually with this command:
pre-commit run --all-files
You can manually disable pre-commit
hooks by running: pre-commit uninstall Use this only in exceptional cases.
Environment variables
Ask your colleagues for .env
files which aren't included in this repository and put them inside the repo's
root directory. Please, never put secrets in the source control. Always align with your IT department security
practices.
To see what variables you need see the .env-sample
file.
CI pipelines
Currently, the project supports only Azure DevOps Pipelines.
By default, the project comes with a single CI pipeline that runs a set of simplified pre-commit hooks on each PR
commit that targets the main
branch.
Documentation
We use MkDocs with Material theme.
To build the docs run:
make docs
If you want to verify the docs locally use:
mkdocs serve
A page like the one below should be available to you under: http://127.0.0.1:8000/
Note
Please note that google style docstrings are used throughout the repo.