Skip to content

Preparing data

There are a few scripts that run all necessary data preparations for training and inference.

Sample plotting

This script will plot composite images such as true color, color infrared and shortwave infrared for quick visual inspection. It will also generate plots for each image presenting aforementioned composites together with NDVI, DEM, QA Mask and Kelp Mask. An example can be seen below:

AF233037_plot

make sample-plotting

Or:

python ./kelp/data_prep/sample_plotting.py \
    --data_dir data/raw \
    --metadata_fp data/raw/metadata_fTq0l2T.csv \
    --output_dir data/processed

For the same tile following composites are plotted:

  • DEM:

AF233037_plot

  • True Color:

AF233037_plot

  • Color Infrared:

AF233037_plot

  • Shortwave Infrared:

AF233037_plot

AOI Grouping

This script will group similar images into AOIs (Areas Of Interest) which then can be used to perform Stratified K-Fold Cross Validation split.

make aoi-grouping

Or:

python ./kelp/data_prep/aoi_grouping.py \
    --dem_dir data/processed/dem \
    --output_dir data/processed/grouped_aoi_results/sim_th=0.97 \
    --metadata_fp data/raw/metadata_fTq0l2T.csv \
    --batch_size 128 \
    --similarity_threshold 0.97

The script will save results in the specified output_dir.

.
├── final_image_groups_similarity_threshold=0.95.json           <- final list of simalr images
├── intermediate_image_groups_similarity_threshold=0.95.json    <- intermediate result of similarity calculation
├── merged_image_groups_similarity_threshold=0.95.json          <- deduplicated list of similar images
└── metadata_similarity_threshold=0.95.parquet                  <- final metadata parquet file created from the final list of similar images

The metadata parquet file is just the metadata CSV file, but with an additional column aoi_id denoting the AOI ID.

EDA

Exploratory Data Analysis scripts are calculating basic statistics about each image and then plotting them as distribution plots.

Run it with:

make eda

Or:

python ./kelp/data_prep/eda.py \
    --data_dir data/raw \
    --metadata_fp data/processed/grouped_aoi_results/sim_th=0.97/metadata_similarity_threshold=0.97.parquet \
    --output_dir data/processed/stats_97

For each image the script will calculate following statistics:

  • has_kelp - a flag indicating if the image has kelp in it
  • non_kelp_pixels - number of non-kelp pixels
  • kelp_pixels - number of kelp pixels
  • kelp_pixels_pct - percentage of all pixels marked as kelp
  • high_kelp_pixels_pct - a flag indicating that the kelp pixels denote over 40% of the whole image
  • dem_nan_pixels - number of NaN pixels in the DEM layer
  • dem_has_nans - a flag indicating if DEM layer has NaN values
  • dem_nan_pixels_pct - percentage of all DEM pixels marked as NaN
  • dem_zero_pixels - number of zero valued pixels in the DEM layer
  • dem_zero_pixels_pct - percentage of all DEM pixels with value=zero
  • water_pixels - estimated number of water pixels (pixels with value <= zero)
  • water_pixels_pct - percentage of water pixels in the DEM layer
  • almost_all_water - a flag indicating that over 98% of the DEM layer pixels are water
  • qa_corrupted_pixels - number of corrupted pixels in the QA band
  • qa_ok - a flag indicating that no pixels are corrupted in the QA band
  • qa_corrupted_pixels_pct - percentage of corrupted pixels in the QA band
  • high_corrupted_pixels_pct - a flag indicating that over 40 % of the QA bands' pixels are corrupted

Calculated statistics will be saved in the specified output directory in parquet format.

Apart from figures, the script will also display and save descriptive statistics such as min, max, median etc. for each of the numerical statistics.

statistic aoi_id non_kelp_pixels kelp_pixels ... qa_corrupted_pixels_pct
std 662.475 2695.318 2695.318 ... 0.138
min 0 8937 0 ... 0
mean 1479.905 121670.837 829.163 ... 0.07
max 2947 122500 113563 ... 0.999
count 5635 5635 5635 ... 7061
75% 1895.5 122500 880 ... 0.074
50% 1407 122388 112 ... 0.013
25% 1196 121620 0 ... 0.001

Example plots that are generated in this step:

  • AOI Images distribution

aoi_images_distribution

  • AOI Images distribution (filtered - without groups with single image)

aoi_images_distribution_filtered

  • Correlation matrix

corr_matrix

  • DEM has NaNs

dem_has_nans

  • DEM NaN pixels distribution

dem_nan_pixels_distribution

  • Has Kelp

has_kelp

  • High Kelp Pixels distribution

high_kelp_pixels_pct

  • Kelp Pixels Distribution

kelp_pixels_distribution

  • QA corrupted pixels percentage

qa_corrupted_pixels_pct

  • QA OK

qa_ok

  • Images per Splits

splits

Calculate band statistics

This script calculates per-band statistics to be used for input image normalization during training and inference. Spectral indices are automatically appended to the input Tensor, this way stats for all possible channels are computed.

Note: The script will automatically try to use GPU to speed up calculation. Expect ~45x slowdowns if running on CPU!

Run it with:

make calculate-band-stats

Or:

python ./kelp/data_prep/calculate_band_stats.py \
    --data_dir data/raw \
    --mask_using_qa \
    --mask_using_water_mask \
    --fill_missing_pixels_with_torch_nan \
    --output_dir data/processed

Note: The script is not perfect - for certain configurations, the resulting statistics can have NaN or Inf in them. Please adjust them by manually setting those items to reasonable values.

Stratified 10-Fold CV split

This script performs stratified 10-fold cross validation split using metadata files generated earlier.

Run it with:

make train-val-test-split-cv

Or:

python ./kelp/data_prep/train_val_test_split.py \
    --dataset_metadata_fp data/processed/stats/dataset_stats.parquet \
    --split_strategy cross_val \
    --seed 42 \
    --splits 10 \
    --output_dir data/processed

Note: By default has_kelp, almost_all_water, qa_ok and high_corrupted_pixels_pct columns are used for making stratification column.