Skip to content

Evaluating models

Models trained both locally and in Azure ML can be evaluated afterward in order to verify different inference settings such as:

  • Influence of decision_threshold
  • Adding and removing TTA
  • Checking different tta_merge_mode
  • Trying out different precision settings

Single model

To evaluate a single model run:

make eval

Or python script directly:

python ./kelp/nn/training/ \
    --data_dir data/raw \
    --metadata_dir data/processed \
    --dataset_stats_dir data/processed \
    --run_dir $(RUN_DIR) \
    --output_dir mlruns \
    --precision bf16-mixed \
    --decision_threshold=0.48 \
    --experiment_name model-eval-exp

To apply TTA:

python ./kelp/nn/training/ \
    --data_dir data/raw \
    --metadata_dir data/processed \
    --dataset_stats_dir data/processed \
    --run_dir $(RUN_DIR) \
    --output_dir mlruns \
    --precision bf16-mixed \
    --tta \
    --tta_merge_mode max \
    --decision_threshold=0.48 \
    --experiment_name model-eval-exp

All eval runs are logged to MLFlow. You can inspect and compare different eval configurations and select the best local models this way.

Multiple models with the same eval config

If you want to eval multiple models at once using the same eval config run:

make eval-many

From folders

If you have a folder with Ground Truth and a folder with predictions you can run:

make eval-from-folders

Which will be equivalent to running:

python kelp/nn/training/ \
    --gt_dir=$(GT_DIR) \
    --preds_dir=$(PREDS_DIR) \
    --tags fold_0_run_dir=$(FOLD_0_RUN_DIR) \
        fold_1_run_dir=$(FOLD_1_RUN_DIR) \
        fold_2_run_dir=$(FOLD_2_RUN_DIR) \
        fold_3_run_dir=$(FOLD_3_RUN_DIR) \
        fold_4_run_dir=$(FOLD_4_RUN_DIR) \
        fold_5_run_dir=$(FOLD_5_RUN_DIR) \
        fold_6_run_dir=$(FOLD_6_RUN_DIR) \
        fold_7_run_dir=$(FOLD_7_RUN_DIR) \
        fold_8_run_dir=$(FOLD_8_RUN_DIR) \
        fold_9_run_dir=$(FOLD_9_RUN_DIR) \
        fold_0_weight=$(FOLD_0_WEIGHT) \
        fold_1_weight=$(FOLD_1_WEIGHT) \
        fold_2_weight=$(FOLD_2_WEIGHT) \
        fold_3_weight=$(FOLD_3_WEIGHT) \
        fold_4_weight=$(FOLD_4_WEIGHT) \
        fold_5_weight=$(FOLD_5_WEIGHT) \
        fold_6_weight=$(FOLD_6_WEIGHT) \
        fold_7_weight=$(FOLD_7_WEIGHT) \
        fold_8_weight=$(FOLD_8_WEIGHT) \
        fold_9_weight=$(FOLD_9_WEIGHT) \
        soft_labels=True \
        split_decision_threshold=None \
        decision_threshold=0.48 \
        tta=False \
        tta_merge_mode=mean \

In that case the evaluation script will compare corresponding masks and predictions and calculate performance metrics. The --tags will be converted to key-value pairs and logged to MLFlow for you to keep track of model params.


Evaluating ensemble is tricky as a separate evaluation dataset is needed. In the example below we use fold=8 validation images to perform evaluation across all folds, which is not ideal.

make eval-ensemble

Under the hood the following steps are executed:

rm -rf data/predictions/eval_results
make cv-predict AVG_PREDS_VERSION=eval PREDS_INPUT_DIR=data/raw/splits/split_8/images AVG_PREDS_OUTPUT_DIR=data/predictions/eval_results
make average-predictions AVG_PREDS_VERSION=eval PREDS_INPUT_DIR=data/raw/splits/split_8/images AVG_PREDS_OUTPUT_DIR=data/predictions/eval_results
make eval-from-folders GT_DIR=data/raw/splits/split_8/masks PREDS_DIR=data/predictions/eval_results