For some of our tests, we need access to the competition's data, which is stored in our config
module.
This reads the training CSV into a pandas DataFrame. You can choose to load the CSV that includes the cross-validation (CV) folds already added to it (if you've created it). You can also choose to load the training data with pseudo-labeled examples added as well (if you've created a CSV of pseudo-labels).
path, df = load_data(TEST_DATA_PATH, with_folds=False)
df.head()
Let's test this by confirming that if the predictions for everything are 0.0
, the average of all the predictions should also be 0.0
.
len(all_zeros_prediction_dfs)
all_zeros_prediction_dfs[0]
averaged_preds_df = average_preds(all_zeros_prediction_dfs); averaged_preds_df
assert np.all(averaged_preds_df == 0.) # Average of a bunch of 0's is 0
test_eq(averaged_preds_df.shape, (5, 4)) # 5 examples, 4 classes
Utility function to load and average all test set prediction CSVs matching naming pattern "predictions_fold_[0-4].csv"
, which is the default naming scheme when running the training script using 5-fold cross-validation.