This file holds function to load the data and create folds for cross-validation.

For some of our tests, we need access to the competition's data, which is stored in our config module.

Load Data

This reads the training CSV into a pandas DataFrame. You can choose to load the CSV that includes the cross-validation (CV) folds already added to it (if you've created it). You can also choose to load the training data with pseudo-labeled examples added as well (if you've created a CSV of pseudo-labels).

load_data[source]

load_data(data_path:Path, with_folds:bool=False, pseudo_labels_path:str=None)

Load data (with/without cross-validation folds) into DataFrame.

path, df = load_data(TEST_DATA_PATH, with_folds=False)
df.head()
image_id healthy multiple_diseases rust scab
0 Train_0 0 0 0 1
1 Train_1 0 1 0 0
2 Train_2 1 0 0 0
3 Train_3 0 0 1 0
4 Train_4 1 0 0 0

Average Predictions

average_preds[source]

average_preds(dfs:List[DataFrame])

Average predictions on test examples across prediction DataFrames in dfs.

Let's test this by confirming that if the predictions for everything are 0.0, the average of all the predictions should also be 0.0.

len(all_zeros_prediction_dfs)
5
all_zeros_prediction_dfs[0]
image_id healthy multiple_diseases rust scab
0 Test_0 0.0 0.0 0.0 0.0
1 Test_1 0.0 0.0 0.0 0.0
2 Test_2 0.0 0.0 0.0 0.0
3 Test_3 0.0 0.0 0.0 0.0
4 Test_4 0.0 0.0 0.0 0.0
averaged_preds_df = average_preds(all_zeros_prediction_dfs); averaged_preds_df
healthy multiple_diseases rust scab
image_id
Test_0 0.0 0.0 0.0 0.0
Test_1 0.0 0.0 0.0 0.0
Test_2 0.0 0.0 0.0 0.0
Test_3 0.0 0.0 0.0 0.0
Test_4 0.0 0.0 0.0 0.0
assert np.all(averaged_preds_df == 0.)    # Average of a bunch of 0's is 0
test_eq(averaged_preds_df.shape, (5, 4))  # 5 examples, 4 classes

Save Averaged Preds

Utility function to load and average all test set prediction CSVs matching naming pattern "predictions_fold_[0-4].csv", which is the default naming scheme when running the training script using 5-fold cross-validation.

get_averaged_preds[source]

get_averaged_preds(path:Path, verbose:bool=False)

Returns DataFrame of averaged of averaged predictions of prediction CSVs in path dir.