train_val_test_split
Train, validation and test dataset split logic.
kelp.data_prep.train_val_test_split.TrainTestSplitConfig
Bases: ConfigBase
A config for generating train and test splits.
Source code in kelp/data_prep/train_val_test_split.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
kelp.data_prep.train_val_test_split.filter_data
Filters dataset by removing images with high kelp pixel percentage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The dataset metadata dataframe. |
required |
Source code in kelp/data_prep/train_val_test_split.py
97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
kelp.data_prep.train_val_test_split.k_fold_split
Runs Stratified K-Fold Cross Validation split on dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The dataset metadata dataframe. |
required |
splits |
int
|
The number of splits to perform. |
5
|
seed |
int
|
The seed for reproducibility. |
SEED
|
Source code in kelp/data_prep/train_val_test_split.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
kelp.data_prep.train_val_test_split.load_data
Loads dataset metadata parquet file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fp |
Path
|
The path to the metadata parquet file. |
required |
Source code in kelp/data_prep/train_val_test_split.py
83 84 85 86 87 88 89 90 91 92 93 94 |
|
kelp.data_prep.train_val_test_split.main
Main entry point for running train/val/test dataset split.
Source code in kelp/data_prep/train_val_test_split.py
279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
|
kelp.data_prep.train_val_test_split.make_stratification_column
Creates a stratification column from dataset metadata and specified metadata columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The dataset metadata dataframe. |
required |
stratification_columns |
List[str]
|
The metadata columns to use for the stratification. |
required |
Source code in kelp/data_prep/train_val_test_split.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
kelp.data_prep.train_val_test_split.parse_args
Parse command line arguments.
Returns: An instance of TrainTestSplitConfig.
Source code in kelp/data_prep/train_val_test_split.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|
kelp.data_prep.train_val_test_split.run_cross_val_split
Runs Stratified K-Fold Cross Validation split on training samples. The test samples will be marked as test split.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_samples |
DataFrame
|
The dataframe with training samples. |
required |
test_samples |
DataFrame
|
The dataframe with test samples. |
required |
splits |
int
|
The number of splits to perform. |
5
|
seed |
int
|
The seed for reproducibility. |
SEED
|
Source code in kelp/data_prep/train_val_test_split.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
kelp.data_prep.train_val_test_split.run_random_split
Runs random split on train_samples. The test samples will be marked as test split.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_samples |
DataFrame
|
The dataframe with training samples. |
required |
test_samples |
DataFrame
|
The dataframe with test samples. |
required |
random_split_train_size |
float
|
The size of training split as a fraction of the whole dataset. |
0.95
|
seed |
int
|
The seed for reproducibility. |
SEED
|
Source code in kelp/data_prep/train_val_test_split.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
|
kelp.data_prep.train_val_test_split.save_data
Saves the specified dataframe under specified output path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The dataframe to save. |
required |
output_path |
Path
|
The path to save the dataframe under. |
required |
Source code in kelp/data_prep/train_val_test_split.py
263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
|
kelp.data_prep.train_val_test_split.split_dataset
Performs dataset split into training, validation and test sets using specified split strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The metadata dataframe containing the training and test records. |
required |
split_strategy |
Literal['cross_val', 'random']
|
The strategy to use. |
'cross_val'
|
random_split_train_size |
float
|
The size of training split as a fraction of the whole dateset. |
0.95
|
splits |
int
|
The number of CV splits. |
5
|
seed |
int
|
The seed for reproducibility. |
SEED
|
Source code in kelp/data_prep/train_val_test_split.py
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 |
|