Reproducibility of results
This guide contains steps necessary to reproduce the competition results.
Before running any of the commands in this section, please make sure you have configured your local development environment by following this guide.
You have two options for reproducing the results:
- Running from scratch
- Using model checkpoints and generated dataset files
I want to train from scratch
If you want to fully reproduce the solution results starting from raw training data. Please follow this steps.
Preparing the data
To prepare the data follow steps outlined in this section.
Download the data
Download and extract the data to following directories in the root of the repo:
kelp-wanted-competition/
└── data
└── raw
├── train
│ ├── images <= place training images here
│ └── masks <= place training masks here
├── test
│ └── images <= place test images here
└── metadata_fTq0l2T.csv <= place the metadata file directly in the `raw` dir
Run in order:
-
Plot samples - for better understanding of the data and quick visual inspection
make sample-plotting
-
AOI Grouping - will group the similar images into AOIs and use those groups to generate CV-folds.
make aoi-grouping
-
EDA - run Exploratory Data Analysis to visualize statistical distributions of different image features
make eda
-
Calculate band statistics - will calculate per-band min, max, mean, std etc. statistics (including spectral indices)
make calculate-band-stats
-
Train-Val-Test split with Stratified K-Fold Cross Validation
make train-val-test-split-cv
The generated train_val_test_dataset_strategy=cross_val.parquet
metadata lookup file
and YYYY-MM-DD-Thh:mm:ss-stats-fill_value=nan-mask_using_qa=True-mask_using_water_mask=True.json
files will have to be used as inputs for the training scripts, both locally and for Azure ML Pipelines.
Training the models
For model training you have two options. Train them on Azure ML or train them locally.
Via Azure ML (AML)
Note: You'll need Azure Subscription and Azure DevOps Organization for that one. You'll also need a basic knowledge of Azure services such as Entra ID, Blob Storage and Azure ML.
- Create Azure ML Workspace
- Setup Service Principal with access to the Azure ML Workspace
- Setup Azure DevOps variable group (see .env-sample for what variables are needed)
- Setup Service Connections for GitHub, AML Workspace and ARM Resource Group (use SP created earlier for it)
- Setup Azure DevOps Pipelines
- In Azure ML set up the following:
- Datasets datastore
- Training Dataset Data Asset - please upload training data to Blob Storage and register it as Folder Asset
- Dataset Stats Data Asset - please upload the stats file to Blob Storage and register it as File Asset
- Dataset Metadata Data Asset - please upload the metadata parquet file (generated by
train-val-test-split-cv
Makefile command) to Blob Storage and register it as File Asset - Compute Clusters with spot instances
- Training Environment
- Once done you'll need to modify the versions and names in the
AML components
and AML pipelines to match the resource
names you have just created. I recommend you use
Azure ML CLI
to set them up from the terminal.
See
yaml
files in the aml folder for details. - You can now trigger the Azure ML Hyperparameter Search or Model Training Pipelines via Azure DevOps Pipelines
Locally
-
Run all folds training:
make train-all-folds
-
Run single fold training:
make train FOLD_NUMBER=<fold-number>
See the Makefile definition to see all available options.
Note: Both Azure ML Pipelines and Makefile commands have been adjusted to use hyperparameters used in the best model submission.
Making predictions
Once the models have been trained you can generate submission file with them by adjusting the run_dir paths in the Makefile and running following commands:
Single model
Run:
make predict-and-submit
The submission directory will be created under data/submissions/single-model
.
Ensemble
Running ensemble prediction is just as easy with following command:
make cv-predict
The submission directories will be created under data/submissions/avg
.
This will run prediction with each fold individually and then average the predictions using weights specified in the Makefile. The weights and decision thresholds in the Makefile commands are already set up to be in line with the winning ones. You just need to adjust the fold dir paths.
Note: Please note that all folds were used for the best submissions. You'll need to train all folds!
I want to use model checkpoints
If you want to just reproduce the final submissions without running everything from scratch follow steps in this section.
Download models and dataset
Download following files.
Download checkpoints
The full model training run directories are hosted here:
- Best single model (private LB 0.7264): best-single-model.zip
- Best submission #1 (private LB 0.7318): top-submission-1.zip
- Best submission #2 (private LB 0.7318): top-submission-2.zip
Please download them and extract to models
directory. The final directory structure should look like this:
models/
├── best-single-model
│ └── Job_sincere_tangelo_dm0xsbhc_OutputsAndLogs
├── top-submission-1
│ ├── Job_elated_atemoya_31s98pwg_OutputsAndLogs
│ ├── Job_hungry_loquat_qkrw2n2p_OutputsAndLogs
│ ├── Job_icy_market_4l11bvw2_OutputsAndLogs
│ ├── Job_keen_evening_3xnlbrsr_OutputsAndLogs
│ ├── Job_model_training_exp_65_OutputsAndLogs
│ ├── Job_model_training_exp_67_OutputsAndLogs
│ ├── Job_nice_cheetah_grnc5x72_OutputsAndLogs
│ ├── Job_strong_door_yrq9zpmd_OutputsAndLogs
│ ├── Job_willing_pin_72ss6cnc_OutputsAndLogs
│ └── Job_yellow_evening_cmy9cnv7_OutputsAndLogs
└── top-submission-2
├── Job_boring_foot_hb224t08_OutputsAndLogs
├── Job_coral_lion_x39ft9cb_OutputsAndLogs
├── Job_dreamy_nut_fkwzmgxh_OutputsAndLogs
├── Job_icy_airport_7r8h9q3c_OutputsAndLogs
├── Job_lemon_drop_cxncbygc_OutputsAndLogs
├── Job_loving_insect_hvd7v5p9_OutputsAndLogs
├── Job_plum_angle_0f163gk5_OutputsAndLogs
├── Job_plum_kettle_36dw15zk_OutputsAndLogs
├── Job_tender_foot_07bt1687_OutputsAndLogs
└── Job_wheat_tongue_mjzjpvjw_OutputsAndLogs
Download competition data
Download the competition data as described in Download the Data section.
Download prepped files
If you don't want to run all data preparation steps you can just download the metadata and dataset stats files from here:
- Dataset metadata - train, val and test 10-fold CV split:
train_val_test_dataset.parquet - place it in
data/processed
directory - Dataset per-band statistics: ds-stats.zip - extract it
and place the JSON file in
data/processed
directory
The final data
directory should have the following structure:
data/
├── auxiliary
├── interim
├── predictions <= single model predictions from ensembles will be saved here
├── processed
│ ├── 2023-12-31T20:30:39-stats-fill_value=nan-mask_using_qa=True-mask_using_water_mask=True.json
│ └── train_val_test_dataset.parquet
├── raw
│ ├── train
│ │ ├── images <= place training images here
│ │ └── masks <= place training masks here
│ ├── test
│ │ └── images <= place test images here
│ └── metadata_fTq0l2T.csv <= place the metadata file directly in the `raw` dir
└── submissions
├── avg <= ensemble submissions will be saved here
└── single-model <= single model submissions will be saved here
Making predictions
To reproduce the best submissions follow this steps.
NOTE: The Makefile commands expect the model and dataset files to be in correct directories!
Single model
Run:
make repro-best-single-model-submission
The submission directory will be created under data/submissions/single-model
.
Ensemble
To reproduce top #1 submission with Priv LB score of 0.7318 run:
make repro-top-1-submission
To reproduce top #2 submission with Priv LB score of 0.7318 run:
make repro-top-2-submission
The submission directories will be created under data/submissions/avg
. Individual model predictions will be saved
under data/predictions/top-1-submission
and data/predictions/top-2-submission
.