Reproducibility of results

This guide contains steps necessary to reproduce the competition results.

Before running any of the commands in this section, please make sure you have configured your local development environment by following this guide.

You have two options for reproducing the results:

Running from scratch
Using model checkpoints and generated dataset files

I want to train from scratch

If you want to fully reproduce the solution results starting from raw training data. Please follow this steps.

Preparing the data

To prepare the data follow steps outlined in this section.

Download the data

Download and extract the data to following directories in the root of the repo:

kelp-wanted-competition/
└── data
   └── raw
      ├── train
      │   ├── images             <= place training images here
      │   └── masks              <= place training masks here
      ├── test
      │   └── images             <= place test images here
      └── metadata_fTq0l2T.csv   <= place the metadata file directly in the `raw` dir

Run in order:

Plot samples - for better understanding of the data and quick visual inspection
```
make sample-plotting
```
AOI Grouping - will group the similar images into AOIs and use those groups to generate CV-folds.
```
make aoi-grouping
```
EDA - run Exploratory Data Analysis to visualize statistical distributions of different image features
```
make eda
```
Calculate band statistics - will calculate per-band min, max, mean, std etc. statistics (including spectral indices)
```
make calculate-band-stats
```
Train-Val-Test split with Stratified K-Fold Cross Validation
```
make train-val-test-split-cv
```

The generated train_val_test_dataset_strategy=cross_val.parquet metadata lookup file and YYYY-MM-DD-Thh:mm:ss-stats-fill_value=nan-mask_using_qa=True-mask_using_water_mask=True.json files will have to be used as inputs for the training scripts, both locally and for Azure ML Pipelines.

Training the models

For model training you have two options. Train them on Azure ML or train them locally.

Via Azure ML (AML)

Note: You'll need Azure Subscription and Azure DevOps Organization for that one. You'll also need a basic knowledge of Azure services such as Entra ID, Blob Storage and Azure ML.

Create Azure ML Workspace
Setup Service Principal with access to the Azure ML Workspace
Setup Azure DevOps variable group (see .env-sample for what variables are needed)
Setup Service Connections for GitHub, AML Workspace and ARM Resource Group (use SP created earlier for it)
Setup Azure DevOps Pipelines
In Azure ML set up the following:
Datasets datastore
Training Dataset Data Asset - please upload training data to Blob Storage and register it as Folder Asset
Dataset Stats Data Asset - please upload the stats file to Blob Storage and register it as File Asset
Dataset Metadata Data Asset - please upload the metadata parquet file (generated by train-val-test-split-cv Makefile command) to Blob Storage and register it as File Asset
Compute Clusters with spot instances
Training Environment
Once done you'll need to modify the versions and names in the AML components and AML pipelines to match the resource names you have just created. I recommend you use Azure ML CLI to set them up from the terminal. See yaml files in the aml folder for details.
You can now trigger the Azure ML Hyperparameter Search or Model Training Pipelines via Azure DevOps Pipelines

Locally

Run all folds training:
```
make train-all-folds
```
Run single fold training:
```
make train FOLD_NUMBER=<fold-number>
```

See the Makefile definition to see all available options.

Note: Both Azure ML Pipelines and Makefile commands have been adjusted to use hyperparameters used in the best model submission.

Making predictions

Once the models have been trained you can generate submission file with them by adjusting the run_dir paths in the Makefile and running following commands:

Single model

Run:

make predict-and-submit

The submission directory will be created under data/submissions/single-model.

Ensemble

Running ensemble prediction is just as easy with following command:

make cv-predict

The submission directories will be created under data/submissions/avg.

This will run prediction with each fold individually and then average the predictions using weights specified in the Makefile. The weights and decision thresholds in the Makefile commands are already set up to be in line with the winning ones. You just need to adjust the fold dir paths.

Note: Please note that all folds were used for the best submissions. You'll need to train all folds!

I want to use model checkpoints

If you want to just reproduce the final submissions without running everything from scratch follow steps in this section.

Download models and dataset

Download following files.

Download checkpoints

The full model training run directories are hosted here:

Best single model (private LB 0.7264): best-single-model.zip
Best submission #1 (private LB 0.7318): top-submission-1.zip
Best submission #2 (private LB 0.7318): top-submission-2.zip

Please download them and extract to models directory. The final directory structure should look like this:

models/
├── best-single-model
│   └── Job_sincere_tangelo_dm0xsbhc_OutputsAndLogs
├── top-submission-1
│   ├── Job_elated_atemoya_31s98pwg_OutputsAndLogs
│   ├── Job_hungry_loquat_qkrw2n2p_OutputsAndLogs
│   ├── Job_icy_market_4l11bvw2_OutputsAndLogs
│   ├── Job_keen_evening_3xnlbrsr_OutputsAndLogs
│   ├── Job_model_training_exp_65_OutputsAndLogs
│   ├── Job_model_training_exp_67_OutputsAndLogs
│   ├── Job_nice_cheetah_grnc5x72_OutputsAndLogs
│   ├── Job_strong_door_yrq9zpmd_OutputsAndLogs
│   ├── Job_willing_pin_72ss6cnc_OutputsAndLogs
│   └── Job_yellow_evening_cmy9cnv7_OutputsAndLogs
└── top-submission-2
    ├── Job_boring_foot_hb224t08_OutputsAndLogs
    ├── Job_coral_lion_x39ft9cb_OutputsAndLogs
    ├── Job_dreamy_nut_fkwzmgxh_OutputsAndLogs
    ├── Job_icy_airport_7r8h9q3c_OutputsAndLogs
    ├── Job_lemon_drop_cxncbygc_OutputsAndLogs
    ├── Job_loving_insect_hvd7v5p9_OutputsAndLogs
    ├── Job_plum_angle_0f163gk5_OutputsAndLogs
    ├── Job_plum_kettle_36dw15zk_OutputsAndLogs
    ├── Job_tender_foot_07bt1687_OutputsAndLogs
    └── Job_wheat_tongue_mjzjpvjw_OutputsAndLogs

Download competition data

Download the competition data as described in Download the Data section.

Download prepped files

If you don't want to run all data preparation steps you can just download the metadata and dataset stats files from here:

Dataset metadata - train, val and test 10-fold CV split: train_val_test_dataset.parquet - place it in data/processed directory
Dataset per-band statistics: ds-stats.zip - extract it and place the JSON file in data/processed directory

The final data directory should have the following structure:

data/
├── auxiliary
├── interim
├── predictions                <= single model predictions from ensembles will be saved here
├── processed
│   ├── 2023-12-31T20:30:39-stats-fill_value=nan-mask_using_qa=True-mask_using_water_mask=True.json
│   └── train_val_test_dataset.parquet
├── raw
│   ├── train
│   │   ├── images             <= place training images here
│   │   └── masks              <= place training masks here
│   ├── test
│   │   └── images             <= place test images here
│   └── metadata_fTq0l2T.csv   <= place the metadata file directly in the `raw` dir
└── submissions
    ├── avg                    <= ensemble submissions will be saved here
    └── single-model           <= single model submissions will be saved here

Making predictions

To reproduce the best submissions follow this steps.

NOTE: The Makefile commands expect the model and dataset files to be in correct directories!

Single model

Run:

make repro-best-single-model-submission

The submission directory will be created under data/submissions/single-model.

Ensemble

To reproduce top #1 submission with Priv LB score of 0.7318 run:

make repro-top-1-submission

To reproduce top #2 submission with Priv LB score of 0.7318 run:

make repro-top-2-submission

The submission directories will be created under data/submissions/avg. Individual model predictions will be saved under data/predictions/top-1-submission and data/predictions/top-2-submission.