Kelp Wanted: Segmenting Kelp Forests - 2nd place solution

Introduction

In the realm of environmental conservation, the urgency to deploy innovative solutions for monitoring critical ecosystems has never been more pronounced. Among these ecosystems, kelp forests stand out due to their vital role in marine biodiversity and their substantial contribution to global economic value. These underwater habitats, predominantly formed by giant kelp, are foundational to coastal marine ecosystems, supporting thousands of species while also benefiting human industries significantly. However, the sustainability of kelp forests is under threat from climate change, overfishing, and unsustainable harvesting practices. The pressing need for effective monitoring methods to preserve these ecosystems is evident.

In response to this challenge, the Kelp Wanted: Segmenting Kelp Forests competition on the DrivenData platform presented an opportunity for machine learning enthusiasts and experts to contribute to the conservation efforts of these vital marine habitats. The competition aimed to leverage the power of machine learning to analyze coastal satellite imagery, enabling the estimation of kelp forests' extent over large areas and over time. This innovative approach marks a significant advancement in the field of environmental monitoring, offering a cost-effective and scalable solution to understand and protect kelp forest dynamics.

The challenge required participants to develop algorithms capable of detecting the presence or absence of kelp canopy using Landsat satellite imagery. This binary semantic segmentation task demanded not only technical expertise in machine learning and image processing but also a deep understanding of the environmental context and the data's geospatial nuances.

The competition underscored the importance of applying a combination of advanced machine learning techniques and domain-specific knowledge to address environmental challenges. The solution, which secured the 2nd place, leveraged a comprehensive strategy that included the use of pre-trained weights, data preprocessing methods, and a carefully optimized model architecture. By integrating additional spectral indices, adjusting the learning strategy, and employing a robust model ensemble, it was possible to achieve significant accuracy in segmenting kelp forests from satellite imagery.

This report details the methodologies and technologies that underpinned the best submissions. From data preparation and feature engineering to model development and validation, outlined are the steps taken to develop a solution proposed approach demonstrates the potential of machine learning to make a meaningful contribution to environmental conservation efforts, paving the way for further research and application in this critical field.

The Author extends his deepest gratitude to the organizers of the "Kelp Wanted: Segmenting Kelp Forests" competition, including DrivenData, Kelpwatch.org, UMass Boston, and Woods Hole Oceanographic Institution, for their visionary approach in bridging machine learning with environmental conservation. Their dedication to addressing the urgent need for innovative solutions in monitoring kelp forests has not only spotlighted the critical state of these ecosystems but has also paved the way for groundbreaking research and development. The meticulous organization of the competition, the provision of a rich dataset, and the support offered throughout the challenge were instrumental in fostering a collaborative and inspiring environment for all participants. In Author's opinion, this competition has not only contributed significantly to the field of environmental science but has also empowered the machine learning community to apply their skills for a cause that extends far beyond technological advancement, aiming for a profound and positive impact on our planet's future.

TL;DL - solution summary

This section contains a TL;DR summary for the 2nd place solution.

What worked

Pre-trained weights
Reorder channels into R,G,B,SWIR,NIR,QA,DEM
AdamW instead of Adam or SGD
Weight decay = 1e-4
Learning rate = 3e-4
Batch size = 32 <- full utilization of Tesla T4
16-mixed precision training
bf16-mixed precision inference
AOI grouping (removing leakage between CV folds)
Quantile normalization
Dice Loss
Weighted sampler @10240 samples per epoch
- has_kelp_importance_factor=3.0
- kelp_pixels_pct_importance_factor=0.2
- qa_ok_importance_factor=0.0
- qa_corrupted_pixels_pct_importance_factor=-1.0
- almost_all_water_importance_factor=0.5
- dem_nan_pixels_pct_importance_factor=0.25
- dem_zero_pixels_pct_importance_factor=-1.0
Masking indices with QA and DEM Water Mask
Extra spectral indices:
- DEMWM,
- NDVI,
- ATSAVI,
- AVI,
- CI,
- ClGreen,
- GBNDVI,
- GVMI,
- IPVI,
- KIVU,
- MCARI,
- MVI,
- NormNIR,
- PNDVI,
- SABI,
- WDRVI,
- mCRIG
UNet + EfficientNet-B5 <- best combo
Decision threshold change to 0.45-0.48
OneCycleLR
10-fold CV
Training for 50 epochs (best model ensemble)
Mixing models with best DICE per split in the ensemble
Soft labels (second-best model used them)
Mixing model architectures and encoders such as:
- unet <- was the best
- linknet
- unet++
- tu-efficeintnet_b5 <- was the best
- tu-mobilevitv2_150.cvnets_in22k_ft_in1k_384
- tu-maxvit_small_tf_384.in1k
- tu-seresnextaa101d_32x8d.sw_in12k_ft_in1k
- tu-rexnetr_200.sw_in12k_ft_in1k
- tu-seresnext26d_32x4d
- tu-gcresnet33ts.ra2_in1k
- tu-seresnext101_32x4d

What did not work

Training from scratch
Larger or smaller weight_decay
Larger or smaller lr
decoder_attention_type="scse"
Losses other than DICE (CE with weighting was close)
Compiling the model and torch-ort
Normalization strategies other than quantile / z-score
Bunch of different index combinations
TTA (for leaderboard)
LR Schedulers other than OneCycleLR
Random split
XGBoost and other tree based models predicting on pixel level
More decoder channels
SAHI
Resize strategy different than pad
Training with larger images
Bigger batches
More frequent val checks
Smaller batch sizes and accumulate_grad_batches > 1
Stochastic Weights Averaging (SWA)

What was not tested

Prithvi-100M - did not have the time to verify it, also it was trained on Harmonised Landsat Sentinel 2 (HLS) data - additional data transforms would need to be implemented.
The Pretrained Remote Sensing Transformer (Presto) - again no time, also - it requires additional data and was not directly trained on Landsat. Sentinel 2 was the main data source for it.
Transformer based models such as Swin transformer or SegFormer - the dev env was already bloated. Including those models would mean adding MMSegmentation package which has completely different training configuration than pytorch-lightning.

Most impactful decisions

Some of the most impactful decisions are listed in the table below. Please note that this is not a full list. Along the way there were other methods that are not mentioned here.

Method	Public LB score	Score increase
Baseline UNet + ResNet-50 encoder	0.6569	N/A
+ AOI grouping with cosine similarity and more robust CV strategy	0.6648	0.0079
+ Dice loss instead of Cross Entropy	0.6824	0.0176
+ Weighted sampler	0.6940	0.0116
+ Appending 17 extra spectral indices + using water mask to mask land	0.7083	0.0143
+ EfficientNet-B5 instead of ResNet-50	0.7101	0.0018
+ Decision threshold optimization + bf16-mixed inference	0.7132	0.0031
+ 10-fold model ensemble	0.7169	0.0037
+ Training for 50 epochs + weighted average + soft labels	0.7210	0.0041

Best single model

The best single model had private LB score of 0.7264 which would result in 5th place in the final ranking.

NOTE: All submissions were created using a Linux machine with 8 core AMD CPU, 32 GB RAM, 2TB SSD disk and RTX 3090 GPU.

The model was a UNet with EfficientNet-B5 encoder pretrained on ImageNet trained in a following fashion:

Trained with Azure ML using Standard_NC4as_T4_v3 spot instances (4 cores, 28 GB RAM, 176 GB disk, Tesla T4 GPU with 16 GB VRAM)
CV-Fold: #7
Channel order: R,G,B,SWIR,NIR,QA,DEM
Additional spectral indices: DEMWM, NDVI, ATSAVI, AVI, CI, ClGreen, GBNDVI, GVMI, IPVI, KIVU, MCARI, MVI, NormNIR, PNDVI, SABI, WDRVI, mCRIG
Default decoder channel numbers
Batch size: 32
Accumulate Grad Batches: 1
Val Check Interval: 1
Loss: Dice
Weighted Sampler: @10240 samples / epoch
Per image weights based on following config:
- almost_all_water_importance_factor: 0.5
- dem_nan_pixels_pct_importance_factor: 0.25
- dem_zero_pixels_pct_importance_factor: -1.0
- has_kelp_importance_factor: 3.0
- kelp_pixels_pct_importance_factor: 0.2
- qa_corrupted_pixels_pct_importance_factor: -1.0
- qa_ok_importance_factor: 0.0
Image size: 352 (resized using padding)
Image normalization strategy: quantile
Masking spectral indices with DEMWM and QA
Replacing missing pixels with torch.nan
Epochs: 10
Optimizer: AdamW
Weight decay: 1e-4
Max learning rate: 3e-4
Learning rate scheduler: OneCycleLR:
- onecycle_div_factor: 2.0
- onecycle_final_div_factor: 100.0
- onecycle_pct_start: 0.1
Precision: 16-mixed
No torch-ort or torch.compile
No benchmark
Seed: 42
Training single model took: ~1:20h

The submission used:

No TTA
bf16-mixed precision
Decision threshold = 0.45
Inference took: ~30s

Best submissions

Two submissions with the same private LB score of 0.7318 were produced. Both of those were an ensemble of 10 models, trained on all 10-fold CV splits. Both submission used the best checkpoints not the last ones.

Submission #1

Trained in the same way as the single model submission with following exceptions:

Epochs increased to 50
Mixture of all 10 CV-Folds
Training single model took: ~6-7h

NOTE: Model trained on fold=5 by mistake was trained for 10 epochs not 50

Inference:

No soft labels
No TTA
Decision threshold for a single model: 0.48
Ensemble model weights:
- fold=0: 1.0
- fold=1: 1.0
- fold=2: 1.0
- fold=3: 1.0
- fold=4: 1.0
- fold=5: 1.0
- fold=6: 1.0
- fold=7: 1.0
- fold=8: 1.0
- fold=9: 1.0
Ensemble decision threshold: 0.48
Inference took: ~4:10 min

Submission #2

Trained in the same way as the single model submission with following exceptions:

Trained with Azure ML using Standard_NC24ads_A100_v4 spot instances (24 cores, 220 GB RAM, 64 GB disk, A100 GPU with 80 GB VRAM)
50 epochs
Loss: Jaccard
Decoder channels: 512, 256, 128, 64, 32
bf16-mixed precision
Mixture of all 10 CV-Folds
Training single model took: ~2h

Inference:

Soft labels
No TTA
bf16-mixed precision
Ensemble model weights:
- fold=0: 0.666
- fold=1: 0.5
- fold=2: 0.666
- fold=3: 0.88
- fold=4: 0.637
- fold=5: 0.59
- fold=6: 0.733
- fold=7: 0.63
- fold=8: 1.0
- fold=9: 0.2
Ensemble decision threshold: 0.45
Inference took: ~4:10 min

Private leaderboard scores

In the end 11 submissions had the score that would qualify them as 2nd place solutions - 8 of those had score of over 0.73. Furthermore, additional 5 submissions had the same score as the 3rd place solution.

All of those submissions used model ensembles. The best single model was #41 in Author's submissions with a score of 0.7264. That would have been 5th place solution.

Code

The code is available on GitHub: https://github.com/xultaeculcis/kelp-wanted-competition

Project documentation

Interactive documentation page is available here: https://xultaeculcis.github.io/kelp-wanted-competition/

Dev Log

A detailed development log has been kept during the competition. You can review what was done to train the model in great detail here.

How-to guides

The best place to start with the solution is to review the How-To guides:

API docs

To learn more about how the code is structured and what are its functionalities go to: Code docs.

Methodology

This section aims to provide a detailed exposition of the methods employed to tackle the complex task of detecting kelp canopy presence using satellite imagery. Overall approach was multifaceted, integrating advanced machine learning techniques with domain-specific insights to develop a robust and effective model.

Issues during training

Before discussing the methods and techniques used in this solution, the Author would like to acknowledge a few issues with the presented approach.

Misaligned DEM and spectral layers - the misalignment became apparent in the last weeks of the competition, that being said, aligning the images was not performed since as per competition rules: "Eligible solutions need to be able to run on test samples automatically using the test data as provided". Test set modification was prohibited. Designing an automated band alignment although possible would not be ideal if additional verification and checks for the final models were needed. Running alignment on the fly would be too expensive to run for each submission.
DICE score discrepancy between local runs and LB - torchmetrics package was used for fast on-GPU metrics calculation. While local DICE scores were in the range 0.84-0.86 the public LB scores were in range 0.70-0.72. This discrepancy was attributed to the test set having different distribution and possibly higher number of false positive kelp labels than the training set. Since the scores were consistent over time and increase in local DICE usually resulted in higher LB scores a decision was made to keep using torchmetrics DICE implementation. After seeing some scores on the competition forum, perhaps further investigation into score alignment issues would yield better results in the end.
The fold number used to compare the locally trained models changed from fold=0 to fold=6 and finally to fold=9. Fold #0 was used mainly during the baseline training phase, where different trainer and dataloader hyperparameters were tested. Once all folds were evaluated against public LB - fold #6 was used as it showed the best performance on public LB. Fold #8 was only used after training larger more capable models with larger amount of samples / epoch and with larger amount of input channels. Latest submissions were always evaluated against fold #8.
Spot instances on Azure ML can be evicted, leading to failed Jobs - in such cases manual Job re-submission is needed.

Software setup

All experiments were conducted on Linux based machines. Azure ML was used to scale the training jobs to the cloud. Local PC used Ubuntu 22.04 while jobs running on Azure ML used Ubuntu 20.04 based Docker images. The code is developed using Python. The environment management is performed using conda and conda-lock. Please refer to the Local env setup guide for setting up the local development environment.

NOTE: The dev environment and code was not tested on Windows or macOS.

A short guide on setting up Azure ML Workspace and Azure DevOps organization is available in Azure ML reproducibility guide. Scheduling AML Jobs is described in Training on Azure ML guide.

Full list of packages with their versions can be found in the conda-lock.yml file. Specs for Azure ML Docker based Environment are here: acpt_train_env

Some packages that were used for this competition include:

Hardware setup

For running quick experiments, data preparation, training models and making submissions a local PC with Ubuntu 22.04 was used. This machine was the only one used to make submission files.

Specs:

OS: Ubuntu 22.04
Python 3.10 environment with PyTorch 2.1.2 and CUDA 12.1
CPU: 8-Core AMD Ryzen 7 2700X
RAM: 32 GB RAM
Disk: 2 TB SSD
GPU: RTX 3090 with 24 GB VRAM

Larger training jobs and hyperparameter optimization sweeps were performed on Azure ML using following spot instances:

Standard_NC4as_T4_v3:
- Ubuntu 20.04 docker image
- Python 3.8 environment with PyTorch 2.1.2 and CUDA 12.1 (the Azure base image forces py38 usage)
- 4 cores
- 28 GB RAM
- 176 GB disk
- Tesla T4 GPU with 16 GB VRAM
Standard_NC24ads_A100_v4:
- Ubuntu 20.04 docker image
- Python 3.8 environment with PyTorch 2.1.2 and CUDA 12.1 (the Azure base image forces py38 usage)
- 24 cores
- 220 GB RAM
- 64 GB disk
- A100 GPU with 80 GB VRAM

Initial Data processing

Before the baseline submission was made, Exploratory Data Analysis (EDA) was performed. Samples were plotted in various composite configurations together with QA, DEM, NDVI and the Kelp Mask in a single figures. Samples were also visualized on the map using QGIS. The organizations choose to not include CRS information - probably to eliminate cheating, where one could use overlapping image extents in order to use leakage for model training. This however hindered robust train-val-test split strategies.

Baseline

For baseline a combo of UNet architecture from Segmentation Model PyTorch and ResNet-50 was used. Initially a 10-Fold Stratified Cross Validation was used. Stratification was done using following per image flag combination: qa_ok, has_kelp, dem_has_nans, high_corrupted_pixels_pct. Band statistics were calculated and z-score normalization strategy was used. NDVI was appended to input Tensor. Pixels with -32k values were filled using 0.0. Bands used in this model were: SWIR, NIR, R, G, B, QA, DEM and NDVI. Batch size of 32, no LR Scheduler, training for 10 epochs, without any form of weighted sampler using torch.nn.CrossEntropy loss. The optimizer was set to Adam with 3e-4 learning rate. Since UNet expects the input tensor W and H dimension to be divisible by 32 the input images were padded to 352x352. This model (after fixing MLFlow checkpoint loading) achieved 0.6569 on public LB.

Data fiddling

After it became apparent that there are "duplicates" in the validation datasets - images captured from different orbits, different platforms or at different times a need for more robust train-test split arose. You can view sample of "duplicated" images below (DEM layer).

duplicated_tiles

You can see that the same Area of Interest is presented on those images. In order to create robust CV split strategy each DEM layer was used for embedding generation. DEM layer was chosen for AOI grouping since it does not contain any striping artifacts, corrupted or saturated pixels or clouds. Missing values in this layer mark the extent of the DEM layer i.e. DEM only goes couple hundred of meters into the sea - there is no point in calculating DEM over the water. Missing values were replaced with zeroes. Every DEM image was passed through a pre-trained ResNet network. The resulting embeddings were then compared with each other - if cosine similarity between two images was over 0.97 they were placed into a single group. The de-duplication resulted in 3313 unique AOI groups. Those groups where then used to perform 10-Fold Stratified CV Split.

What's more a few images had been mislabelled as kelp. Those images were filtered out from the training dataset.

corrupted_images

Input channels were also re-ordered so that the first channels match those used by pre-trained networks trained on ImageNet and other natural-image datasets. R, G, B, SWIR, NIR, QA, DEM band order consistently outperformed the default SWIR, NIR, R, G, B, QA, DEM by a small margin - val/dice 0.760 vs 0.762

Extra channels - spectral indices

In Author's other projects utilization of additional input channels such as NDVI, EVI, Water Masks and other proved to greatly bump model predictive capabilities. To see a list of all implemented spectral indices see the indices page.

The best models all used 17 extra spectral indices appended to the input tensor of R, G, B, SWIR, NIR, QA, DEM bands: DEMWM, NDVI, ATSAVI, AVI, CI, ClGreen, GBNDVI, GVMI, IPVI, KIVU, MCARI, MVI, NormNIR, PNDVI, SABI, WDRVI, mCRIG.

To see more visualization of those indices refer to MLFlow spectral indices artifacts guide.

Normalization strategy

Following normalization strategies were evaluated early on using split=6:

z-score: 0.834168
quantile: 0.834134
min-max: 0.831865
per-sample-quantile: 0.806227
per-sample-min-max: 0.801893

In the end quantile normalization was used since it produces the most appealing visual samples and was more robust against outliers, the learning curve also seemed to converge faster. The idea behind quantile normalization is rather simple instead of using global min-max we calculate per-channel quantile values for q01 and q99. During training Min-Max normalization is used with q01 and q99 as min and max values respectively.

You can view the effects of normalization strategies in the figures below.

Z-Score:

z-score

Quantile:

quantile

Min-Max:

min-max

Per-Sample Min-Max:

per-sample-min-max

Per-Sample Quantile:

per-sample-quantile

Additionally, replacing corrupted, missing and land pixels with 0.0 and torch.nan was tested. Masking with torch.nan was better than using zeroes (Public LB 0.7062 vs 0.7083). A rationale here is that using zeroes can lead to suboptimal normalization. Masking with torch.nan is performed instead. After the spectral indices are calculated and appended to the input tensor, for each channel the nan and -inf pixels are replaced with minimal value for each channel, while inf pixels are replaced with maximal value per channel. Masking land and corrupted pixels in indices bumps the performance by over 1-2%

Loss functions

Various loss functions were evaluated:

torch.nn.CrossEntropyLoss (weight=[0.4,0.6])
smp.losses.DiceLoss
smp.losses.JaccardLoss
smp.losses.TverskyLoss
torch.nn.CrossEntropyLoss (weight=[0.3,0.7])
torch.nn.CrossEntropyLoss (weight=None)
smp.losses.SoftCrossEntropyLoss (smooth_factor=0.1))
smp.losses.FocalLoss
smp.losses.LovaszLoss
smp.losses.SoftCrossEntropyLoss (smooth_factor=0.2)
smp.losses.SoftCrossEntropyLoss (smooth_factor=0.3)
torch.nn.CrossEntropyLoss (weight=[0.1,0.9])
kelp.nn.models.losses.XEDiceLoss
kelp.nn.models.losses.ComboLoss
kelp.nn.models.losses.LogCoshDiceLoss
kelp.nn.models.losses.HausdorffLoss
kelp.nn.models.losses.TLoss
kelp.nn.models.losses.ExponentialLogarithmicLoss
kelp.nn.models.losses.SoftDiceLoss
kelp.nn.models.losses.BatchSoftDice

In the end Dice loss was used, since it performed best on public leaderboard. The best two model ensembles with the same private LB score of 0.7318 used DICE and Jaccard losses respectively.

Weighted sampler

Since over 2000 images do not have any kelp pixels in them. It was apparent that those images will not contribute to the training very much. Weighted Sampler was then used and adjusted via hparam search on Azure ML.

Following metadata stats and flags were used to determine the optimal per-image weight:

has_kelp - a flag indicating if the image has kelp in it
kelp_pixels_pct - percentage of all pixels marked as kelp
dem_nan_pixels_pct - percentage of all DEM pixels marked as NaN
dem_zero_pixels_pct - percentage of all DEM pixels with value=zero
almost_all_water - a flag indicating that over 98% of the DEM layer pixels are water
qa_ok - a flag indicating that no pixels are corrupted in the QA band
qa_corrupted_pixels_pct - percentage of corrupted pixels in the QA band
Some results after whole night of training:

samples_per_epoch	has_kelp	kelp_pixels_pct	qa_ok	qa_corrupted_pixels_pct	almost_all_water	dem_nan_pixels_pct	dem_zero_pixels_pct	val/dice
5120	2	0.5	0.5	-0.5	-1	0	-0.25	0.84405
5120	0.2	0	-0.5	-0.5	-0.5	0.25	0.5	0.84402
5120	3	0.5	-1	0	-1	0.25	0	0.84396
5120	3	0.2	0	-1	0.5	0.25	-1	0.84396
5120	2	0	0.5	0	0.25	0	-0.5	0.84391
5120	0.5	0.2	-0.5	0.75	0.5	-1	-0.25	0.84390
5120	3	0.2	0.5	-0.25	0.75	-0.25	0.5	0.84382
5120	3	0.5	1	-0.25	0.5	-0.5	0	0.84382
5120	3	2	-0.25	-0.5	-0.5	-1	0.75	0.84380
5120	2	1	0.25	-1	0.75	0.75	1	0.84377
5120	2	0	-0.5	0	-1	-1	-1	0.84374
5120	0.5	0	-1	-0.25	0.25	-0.25	-0.5	0.84373
5120	0.2	0	0	0.25	-1	0.25	0.5	0.84370
5120	2	0.5	0.25	-0.5	0.25	0.5	-0.5	0.84369

After retraining using 10240 samples per epoch:

samples_per_epoch	has_kelp	kelp_pixels_pct	qa_ok	qa_corrupted_pixels_pct	almost_all_water	dem_nan_pixels_pct	dem_zero_pixels_pct	val/dice
10240	2	0.5	0.5	-0.5	-1	0	-0.25	0.84459
10240	0.2	0	-0.5	-0.5	-0.5	0.25	0.5	0.84456
10240	3	0.5	-1	0	-1	0.25	0	0.84501
10240	3	0.2	0	-1	0.5	0.25	-1	0.84801
10240	2	0	0.5	0	0.25	0	-0.5	0.84641
10240	0.5	0.2	-0.5	0.75	0.5	-1	-0.25	0.84622
10240	3	0.2	0.5	-0.25	0.75	-0.25	0.5	0.84546
10240	3	0.5	1	-0.25	0.5	-0.5	0	0.84619
10240	3	2	-0.25	-0.5	-0.5	-1	0.75	0.84500
10240	2	1	0.25	-1	0.75	0.75	1	0.84508
10240	2	0	-0.5	0	-1	-1	-1	0.84430
10240	0.5	0	-1	-0.25	0.25	-0.25	-0.5	0.84496
10240	0.2	0	0	0.25	-1	0.25	0.5	0.84522
10240	2	0.5	0.25	-0.5	0.25	0.5	-0.5	0.84538

Training for 10240 samples / epoch yielded better results. Assigning higher importance to has_kelp flag while keeping the almost_all_water flag low and even negative was the best combination. Zeroes and NaN values in the DEM layer seemed to not be important as much - weights for those stats usually ranged from -0.25 - 0.25. The QA band stats seemed also to be only slightly important with weights being in rage -1.0 - 1.0.

For public LB following configuration worked the best:

10240 samples / epoch
has_kelp_importance_factor=3.0
kelp_pixels_pct_importance_factor=0.2
qa_ok_importance_factor=0.0
qa_corrupted_pixels_pct_importance_factor=-1.0
almost_all_water_importance_factor=0.5
dem_nan_pixels_pct_importance_factor=0.25
dem_zero_pixels_pct_importance_factor=-1.0

Training configuration

Multiple training configurations were tested out including:

Training from scratch - very slow convergence, training using pre-trained model is a must in this case. Final DICE after 10 epochs=0.736, compared to 0.760 with imagenet weights
Optimizers: Adam, SGD, AdamW - in the end AdamW was the best on local validation set -> +0.02 for AdamW
OneCycleLR vs no LR scheduler: 0.76640 vs 0.764593 but overall stability is better with 1Cycle
weight_decay:
- 1e-2: 0.759
- 1e-3: 0.763
- 1e-4: 0.765
- 1e-5: 0.762
Tried to add decoder_attention_type="scse" but it gives worse performance (val/dice=0.755)
Learning rate set to 3e-4 worked best locally
UNet required images to be divisible by 32 - resized the input images to 352x352 using zero-padding - during the inference the padding is removed
Using other resize strategies and image sizes did not generate better results
Batch size of 32 was selected mainly to keep Tesla T4 GPUs fully saturated, almost at the edge of running into OOM errors
Training with batch_size=8 and accumulate_grad_batches=4 resulted in better local eval scores, but did not improve leaderboard scores
The train augmentations were rather simple:
- Random Horizontal Flip
- Random Vertical Flip
- Random Rotation 0-90 deg.
Using Random Resized Crops did not work better than the base augmentations mentioned above
Transforms (including appending spectral indices) were performed on batch of images on GPU using kornia library
Compiling the model using torch-ort or torch.compile did not yield much of speedups, in some configurations the compilation phase took 50% of training time - in this case it is better suited for large multi-day and multi-GPU training runs where those few % of speedup can really be beneficial.

TTA

Test time augmentations were also tested. ttach library was used for it. Following augmentations were used during testing:

Vertical flip
Horizontal flip
Rotation: 0, 90, 180, 270

Results with different tta_merge_mode:

Baseline (no TTA) val/dice=0.85339:
max: 0.85490
mean: 0.85458
sum: 0.85458
min: 0.85403
gmean: 0.15955
tsharpen: 0.00468 - loss was nan

On public LB:

no-tta: 0.7083
max: 0.7076
mean: 0.7073

Decided that TTA was not working and never used it again...

Thresholding

Multiple decision threshold were verified on local validation set.

threshold	val/dice
0.3	0.85301
0.31	0.85305
0.32	0.85306
0.33	0.85307
0.34	0.85309
0.35	0.85314
0.36	0.85314
0.37	0.85313
0.38	0.85315
0.39	0.85317
0.4	0.85316
0.41	0.85317
0.42	0.85317
0.43	0.85320
0.44	0.85319
0.45	0.85320
0.46	0.85318
0.47	0.85314
0.48	0.85316
0.49	0.85317
0.5	0.85316
0.51	0.85315
0.52	0.85314
0.53	0.85313
0.54	0.85311
0.55	0.85309
0.56	0.85305
0.57	0.85303
0.58	0.85300
0.59	0.85300
0.6	0.85296
0.61	0.85296
0.62	0.85290
0.63	0.85287
0.64	0.85285
0.65	0.85281
0.66	0.85278
0.67	0.85274
0.68	0.85267
0.69	0.85259
0.7	0.85253

Leaderboard using dt=0.45: 0.7077

Model Architectures and Encoders

Various Model architectures and encoders were tested to find the best performing ones. Azure ML Sweep Job was used to verify encoder + architecture pairs. The encoders came from both segmentation-models-pytorch and timm libraries. A sample job results are visible below.

General observations:

ResUnet++ often results in NaN loss.
Unet and Unet++ often were the best.
Unet++ training was not deterministic even though pl.seed_everything(42) was used
Bigger models often resulted in OOM errors during training - had to reduce batch size and apply gradient accumulation
Some models expected the input image to be divisible by 8, 24, 128 etc. - the training config class had to be adjusted to change the input image_size parameter to allow for training those models.
In general bigger models worked better
FCN - model collapses
ConvNext and SWIN Transformer models not supported
The best combo was UNet + EfficientNet-B5

Results after overnight training on split=6 (Top 5 runs only):

encoder	architecture	val/dice
tu-efficientnet_b5	unet	0.85854
tu-seresnext101_32x4d	unet	0.85807
tu-resnest50d_4s2x40d	unet	0.85787
tu-rexnetr_300	unet	0.85749
tu-seresnext26d_32x4d	unet	0.85728

As you might have noticed in the screenshot in the Private leaderboard subsection some of the top performing model ensembles used a mixture of EfficientNet, SE-ResNext, ReXNet-R and other encoders trained as part of LinkNet, UNet and UNet++ architectures. Although, best two submissions used an ensemble of just UNet + EfficientNet-B5 models.

Stochastic Weights Averaging (SWA)

SWA was one of the last things that were tested during the competition. In short, SWA performs an equal average of the weights traversed by SGD with a modified learning rate schedule (see the left panel of figure below). SWA solutions end up in the center of a wide flat region of loss, while SGD tends to converge to the boundary of the low-loss region, making it susceptible to the shift between train and test error surfaces (see the middle and right panels of figure below).

Source: Stochastic Weight Averaging in PyTorch Illustrations of SWA and SGD with a Preactivation ResNet-164 on CIFAR-100 [1]. Left: test error surface for three FGE samples and the corresponding SWA solution (averaging in weight space). Middle and Right: test error and train loss surfaces showing the weights proposed by SGD (at convergence) and SWA, starting from the same initialization of SGD after 125 training epochs. Please see [1] for details on how these figures were constructed.

The figure below presents and illustration of the learning rate schedule adopted by SWA. Standard decaying schedule is used for the first 75% of the training and then a high constant value is used for the remaining 25%. The SWA averages are formed during the last 25% of training.

The best model ensemble trained with SWA had public and private LB score of 0.7159 and 0.7285 respectively - thus, was not the best. Best single model with SWA had public and private LB score of 0.7147 and 0.7226 respectively.

[1] Averaging Weights Leads to Wider Optima and Better Generalization; Pavel Izmailov, Dmitry Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson; Uncertainty in Artificial Intelligence (UAI), 2018

XGBoost

Another idea that did not work was training XGBoost Classifier on all channels and spectral indices. The goal was not to use XGBoost to establish channel and spectral index importance, but rather completely replace the deep learning approach with classical tree-based model. Unfortunately best models on public LB had score of 0.5125. Which was way too low to try to optimize. Since predictions are done per pixel, TTA cannot be usd. Optimizing decision threshold did not improve the scores much. Even applying post-processing operations such as erosion or dilution would not bump the performance by 20 p.p.

Sample predictions:

If you want to learn more about XGBoost approach and to reproduce the results, please see: XGBoost guide.

The XGB feature importance at least gave the Author confidence that the original approach with appending spectral indices to the input tensor was somewhat justified.

SAHI

Slicing Aided Hyper Inference or SAHI is a technique that helps overcome the problem with detecting and segmenting small objects in large images by utilizing inference on image slices and prediction merging. Because of this it is slower than running inference on full image but at the same time usually ends up having better performance, especially for smaller features.

This was the last idea that was tested.

sahi_gif Source: https://github.com/obss/sahi

The idea was simple:

Generate sliced dataset of small 128x128 non-overlapping tiles from the bigger 350x350 input images
Use this dataset to train new model
During training resize the crops to e.g. 320x320 resolution and train on those
When running inference generate overlapping tiles, inference on those tiles, and merge the predicted masks by averaging the predictions in the overlapping areas
Profit?

A sample of small tiles used for training are visualized below:

short_wave_infrared_batch_idx=1_epoch=00_step=0320.jpg

Best model trained on 128x128 crops with 320x320 resize and nearest interpolation resulted in public LB score of: 0.6848. As it turns out, the input time size of 350x350 is already too small for SAHI to shine. Sliced inference did not result in any meaningful bump in performance.

If you want to learn more about SAHI approach and to reproduce the results, please see: SAHI guide.

Ensemble

Averaging model predictions was a natural step to take in the later stages of the competition. The prediction scripts were written in a way to allow to optimize both decision threshold and usage of TTA. The final prediction averaging was done using per-model weights. The weights were obtained by using MinMaxScaler with feature range of (0.2, 1.0) on public LB scores obtained for each model. This resulted with following weights:

Fold #	Public LB Score	Calculated Weight
0	0.7110	0.666
1	0.7086	0.5
2	0.7110	0.666
3	0.7139	0.88
4	0.7106	0.637
5	0.7100	0.59
6	0.7119	0.733
7	0.7105	0.63
8	0.7155	1.0
9	0.7047	0.2

2nd best submission used those weights. The best submission used following weights:

Fold #	Public LB Score	Calculated Weight
0	0.7110	1.0
1	0.7086	1.0
2	0.7110	1.0
3	0.7139	1.0
4	0.7106	1.0
5	0.7100	1.0
6	0.7119	1.0
7	0.7105	1.0
8	0.7155	1.0
9	0.7047	1.0

Model stacking was not performed.

Soft labels

Once the ensemble prediction pipeline was in place soft label prediction was implemented. Using soft labels is rather simple - instead of using hard labels i.e. 0 - for no kelp and 1 for kelp we use probability of certain pixel belonging to kelp class. We can then use weighted average and decision threshold adjustment to make final prediction.

As seen in the LB screenshot in the Private LB scores subsection, most of the top performing submissions used soft labels - the 2nd best submission included.

Conclusion

In this highly competitive and challenging contest, approach to segmenting kelp forests using Landsat satellite imagery and deep learning has demonstrated the power of combining advanced machine learning techniques with domain-specific knowledge. Used methodology was propelled to the second place, showcasing a robust and effective solution for the critical environmental task of mapping and monitoring kelp canopy presence.

The utilization of pre-trained weights laid a strong foundation, allowing the models to learn from relevant, pre-existing patterns, significantly accelerating the learning process. The introduction of the Normalized Difference Vegetation Index (NDVI) and other spectral indices such as ATSAVI, AVI, CI, and more, provided the models with enhanced capabilities to distinguish between kelp and non-kelp regions effectively. These indices, especially when coupled with strategy of reordering channels to align more closely with natural and scientific observations, significantly improved model performance.

Decision to employ AdamW optimizer, accompanied by a carefully chosen weight decay and learning rate scheduler, further optimized the training process, striking a fine balance between fast convergence and avoiding overfitting. The choice of a 32 batch size fully leveraged the computational capacity of the Tesla T4 GPUs, ensuring efficient use of resources.

The implementation of mixed-precision training and inference not only reduced model's memory footprint but also accelerated its computational speed, making the solution both effective and efficient. By adopting a weighted sampler with a tailored importance factor for each type of pixel, the models could focus more on significant areas, reducing the bias towards dominant classes.

The UNet architecture, augmented with an EfficientNet-B5 encoder, proved to be the best combination for the task, striking a balance between accuracy and computational efficiency. Adjusting the decision threshold allowed for fine-tuning the model's sensitivity to kelp presence, which was crucial for achieving high evaluation scores.

Comprehensive approach, which included ten-fold cross-validation, training models for 50 epochs, and creating an ensemble of the best-performing models, ensured that the solution was not only robust across different data splits but also generalized well to unseen data.

To conclude, this work underscores the potential of machine learning to contribute significantly to environmental conservation efforts. By employing a mix of techniques, from spectral analysis to advanced deep learning models, a scalable and accurate method for monitoring kelp forests was developed. This solution not offers a practical tool for ecosystem management and conservation. Success of Deep Learning approach in this challenge reaffirms the importance of machine learning in addressing some of the most pressing environmental issues of our time.