Evaluating the Amazon

We’ve been exploring a problem in image classification applicable to land use management. This is the fourth and final post in a series that also illustrates the application of the CRISP-DM framework. For the more technical portions of this post, I again give huge kudos to Fast.ai for their technology and courses. Here is the Kaggle notebook1 that goes with this post. Note that fast.ai is a fast-moving library, and some of the functions have changed since my last post; so that code does not fully work anymore!

As I mentioned at the end of the last post, there are many opportunities to improve the model. Some of these are explained in fast.ai lessons 1 and 3, and many more are mentioned in Kaggle forums. I’m not covering those here, but mention a few possibilities briefly.

First, data augmentation via image transformations is often very helpful for deep learning modeling. This helps the network generalize to new images more successfully. A related task would be to manipulate the image resolution. A recent approach is to start out, as we did, with low resolution, then refine the trained network with higher resolution images in later training cycles.

A second category of model improvements would be to tune the hyper-parameters. These are the parts of the model or training regimen that don’t get directly set by training on the data. Deep learning has more of these than most other machine learning model types. Examples are the learning rate, momentum, and accuracy threshold for the output layer. Even the network architecture could be thought of as a hyper-parameter.

Third, there are different training regimens for the network itself. One cycle learning is an example; it helps tune these hyper-parameters. This paper explains it quite thoroughly, but you don’t have to understand this completely to use fast.ai, as it implements it all under the covers, in a method called fit_one_cycle.

In a longer study of the problem we would actually do this and revisit the modeling step, but I’ll let you explore all of this on your own.

Looking at Errors

More to the point of evaluation, a good step is to look at what the model is most confused about. Fortunately for us, fast.ai provides some tools to help with this task, the main one being ClassificationInterpretation:

> interp = ClassificationInterpretation.from_learner(learn)
> interp.plot_multi_top_losses(9) 

Note that this is applied to the validation set. Let’s look at some of the results, where each image includes the predicted and actual labels, the loss, and the probability of that prediction. Intuitively, loss is a metric capturing how far off from correct the network’s prediction is. These first three are interesting: It looks like images aren’t always consistent in terms of whether they’re labeled as haze, cloudy, or partly cloudy.

This next one is also confusing to the network – I can see how it’d be tough to tell this is agriculture:

Finally, I don’t see how anyone could categorize these next two with any meaningful label:

When examining images like this, there’s an opportunity to throw out bad data and re-run the evaluation.

A second useful tool is a confusion matrix, which plots counts of incorrect labels. These are more complicated to interpret in our multi-label case, and aren’t implemented yet in fast.ai in any case. Similarly, the function to look at the most confused labels isn’t available yet (the interp.most_confused function is what I would use in a regular classification scenario).

Test Set Performance

The last task in evaluating our model is to examine performance on the test set. Recall that we held out some of our labeled data. Kaggle provides a test set that we could use instead, but since the competition is over, I can’t submit my labels to get a score. So let’s look at the held out data instead. Note that I waited until near the very end to do this. I don’t want to be biased in my model exploration by looking at my test data too soon.

The approach needed to get the test set labels is a bit convoluted due to the way that fast.ai reads in its data. I’ll let you examine all the gritty details in the Kaggle notebook accompanying this post. Suffice it to say that I compare the predictions of the network on the test images to their actual labels. We find that the test set F2 score is 0.607, compared to 0.866 on the validation set.

Revisiting CRISP-DM


Image Attribution: Kenneth Jensen [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)]

Phew! We’ve done a lot in this series. My goal all along has been to illustrate the application of the CRISP-DM framework to a problem relevant to Drawdown. Here’s a nice visual illustration of the steps. Hopefully this post, with our discussion of hyper-parameters and examining errors on the validation set, has illustrated why the figure includes arrows pointing back to earlier stages. Many of the discussion points here could be thought of as data preparation, others as further steps in modeling.

We’ve now covered all the CRISP-DM steps except the last, deployment. This is often the most challenging. It’s not clear how, if at all, Planet used the results of the competition. If you look at their blog, however, it’s obvious that agriculture is a key customer sector for them. Deployment in the Kaggle world is simple in comparison, all that’s needed is to produce labels for the test set and upload them, after which you receive a score.

Thanks for following along in this series! I’d love to hear who else uses CRISP-DM to guide them, and how. By the way, after this post I’ll be posting less frequently, more like alternating weeks or every few weeks. F

Photo by rawpixel on Unsplash

Footnotes

1. This points to a Jupyter notebook which can be imported into Kaggle Notebooks to run. I have successfully run it there, but was unable to commit and share within Kaggle Kernels themselves, possibly due to the file copying or some technicality with fast.ai.

Please Join Our Community!

* indicates required
cindi.thompson Written by: