Deep Amazon Image Classification

Welcome to the next post in my ongoing series covering image classification for land use management, a key driver in carbon sequestration. We’re approaching this problem using the CRISP-DM methodology, and have reached step four, modeling. We build up from a simple to a more complex model, and discuss how we will assess the models. We’ll cover their full evaluation in a future post. The code for this post is divided into this Jupyter notebook and this Kaggle kernel, for reasons we discuss.

Baseline Model

When modeling, I always like to start out from a baseline: what’s the simplest thing I can think of? I want to create something without looking much at my data, which I can use as a sanity check that I’m doing better than random guessing. To create a baseline for this problem, let’s look at just the training labels, without the associated images. Recall from our label frequency chart (reproduced at right), primary (rain forest) is the most common label. So a naive approach would be to label every new image as primary. However, we can be a little smarter and add the second most prominent label, as we know an image can have multiple labels, and also that weather is always labeled. Clear is the most common weather label, so we can add that in. Note that we are working here with the data splits that we created in the prior post:

> valid_true = validation_data[label_list] # get the actual labels for the validation set
> # set up an empty prediction vector, length the number of classes
> preds = np.zeros(len(label_list))
> # set the primary and clear labels to positive
> preds[label_list.index('primary')] = 1
> preds[label_list.index('clear')] = 1
> # now populate a prediction matrix, every example just with the primary label
> valid_baseline_pred = pd.DataFrame([preds]*len(validation_data),columns=label_list)

Evaluation in a Multi-label Scenario

How well does such naive approach perform? We’ll cover evaluation more thoroughly in a future post, but a preliminary evaluation is needed to determine which model might perform best on unseen images. To decide how to evaluate performance, let’s revisit our business/science goal. Part of it is the desire to discover possible problems such as illegal mining or burning. Because we want to find as many as possible, we’d rather err on the side of allowing false positives versus false negatives. Also, the Kaggle competition provides an evaluation metric for us: the F2 score.

Briefly, this score is based on the F-score, which is the harmonic mean of precision and recall. As I used to tell my students, intuitively, recall measures how well we capture the whole truth; precision measures how well we capture nothing but the truth. The F2 score emphasizes recall over precision. It’s implemented for us in sklearn. Let’s apply it to our baseline predictions:

> from sklearn.metrics import fbeta_score
> fbeta_score(valid_true, valid_baseline_pred, beta=2, average='samples')
0.6433861442943

We could likely perform better by, for example, setting each bit in the prediction vector to one based on the training data proportion of that label, but we’ll stop with this simpler model so we can move on to applying some machine learning.

Next Best Model

I want something a bit better to compare deep learning to, and something I can run on any machine (as we’ll see, with great learning power comes great compute needs). I’ll use logistic regression, as it’s one of the first classifiers taught in machine learning classes. For today’s walkthrough, I start from one of the Kaggle competitor’s discussions here, and modify and expand on it a bit. I’m wrapping the logistic regression classifier with sklearn’s `OneVsRestClassifier`. This treats each class as binary and can label multiple classes positive when predicting labels.

> from sklearn.multiclass import OneVsRestClassifier
> from sklearn.linear_model import LogisticRegression
> # specifying the solver lets us run this without warnings:
> clf = OneVsRestClassifier(LogisticRegression(solver='liblinear'))

I downscale the images to 32×32. Anything much larger, and my laptop gets really loaded down.

> rescaled_dim = 32
> X_train = np.squeeze(np.array([cv2.resize(io.imread(os.path.join(PLANET_KAGGLE_ROOT, 'train-jpg', name+'.jpg')),
                                          (rescaled_dim, rescaled_dim), cv2.INTER_LINEAR).reshape(1, -1)
                               for name in train_data['image_name'].values]))
Now get the correct labels for the training data
> y_train = train_data[label_list]
... and the validation set
> X_valid = np.squeeze(np.array([cv2.resize(io.imread(os.path.join(PLANET_KAGGLE_ROOT, 'train-jpg', name+'.jpg')),
                                          (rescaled_dim, rescaled_dim), cv2.INTER_LINEAR).reshape(1, -1)
                               for name in validation_data['image_name'].values]))
> y_valid = validation_data[label_list]
concatenate and normalize
> allXs = np.concatenate((X_train, X_valid))
> allXs = sklearn.preprocessing.scale(allXs)
> X_train = allXs[:ntrain]
> X_valid = allXs[ntrain:]

Finally, fit the model:

> clf.fit(X_train, y_train)

How does this model perform? It turns out we obtain an F2 score of only 0.676, a few points higher than the baseline model1. Not very impressive for all that computing work! But I’m not too surprised, as I didn’t create any features from the images, and threw away all the structural information about the RGB channels and the x-y coordinates of the RGB values. These are things I would turn to next, if it weren’t for deep learning, which can learn about those features with very little work on my part.

Deep Learning

As noted in the prior post in this series, deep learning (DL) has taken over image classification along with several other AI application areas. A great source for learning DL is fastai.com, and I’m using their package to illustrate an approach for our problem. The latest course even devotes part of a lesson to this problem, and I’m borrowing some of that in addition to the beginning tutorial provided by fastai.

A typical laptop isn’t powerful enough to run deep learning algorithms in a reasonable period of time. Even if so, my Macbook doesn’t have a GPU, which is a requirement for the best deep learning libraries out there. However, it’s super easy to use free or cheap sources; I’m using this Kaggle kernel for this part of the post.

I’m illustrating an approach that is a “baseline” from the DL point of view. There is a lot more we could do for this problem, but my goal is to show that not much tweaking is required to outperform the baseline models we reviewed above. We start by setting up the training and validation sets as before.

> labels_df = pd.read_csv(f'{PATH}train_v2.csv')
> num_exs = len(labels_df)
> ntrain = int(num_exs * .6)
> nval = int((num_exs-ntrain)/2)

Next we read in the data, then scale it. ImageItemList deals with multi-label learning for us, and DataBunch is the data type that fastai requires for its input data.

> src = (ImageItemList.from_csv(PATH, 'train_v2.csv', folder="train-jpg", suffix=".jpg")
       .split_by_idxs(list(range(ntrain)),valid_idx=list(range(ntrain,ntrain+nval)))  # get the same training data as baseline
       .label_from_df(sep=' ')     # one-hot encoding
      )
> data = (src.transform(tfms=None, size=rescaled_dim)  # resize
        .databunch(bs=bs, num_workers=num_workers) # format needed for training
        .normalize(imagenet_stats))  # like sklearn.preprocessing.scale, with some twists

We next set up the neural network, its architecture, and an evaluation metric to apply. This is slightly more complicated. In short, the network has different layers with different interconnections between them. A convolutional neural network (CNN) is a type of DL network typically used in computer vision; there’s a course on them at Stanford. The second thing we need is an initialization of the parameters of the network – intuitively these are weights on the connections between nodes in the network. We’re leveraging the ImageNet dataset by initializing the parameters with a pre-trained network called resnet50. Then, we again use F2 to evaluate the learned network on the validation data.

> arch = models.resnet50
> def f2_score(pred, act, **kwargs):
    return fbeta(pred, act, beta=2, thresh=0.2, **kwargs)
> learn = create_cnn(data, arch, metrics=[f2_score], model_dir='/tmp/models')

Finally, we’ll fit our network to the Amazon training images. Usually, we’d run the dataset through the network multiple times (or “epochs”), but I’m going to just do it once.

> learn.fit(1)

The output shows the training loss, validation loss, and validation F2 score. Loss intuitively captures which examples the network was either confident and wrong about or not confident about. And just like that, we’ve outperformed our logistic regression classifier by a huge amount, with an F2 score of 0.91 versus 0.68!

There is a lot more we could do to improve this score. The Kaggle forums include many sample approaches to this problem, with the top-ranked solutions getting quite complex. However, I hope this post gives a general flavor for applying modeling to this type of problem. Perhaps it whetted your appetite enough to explore image classification in more depth on your own!

Footnotes

0. Photo by David Clode on Unsplash

1. I later tried sklearn’s ExtraTreesClassifier, and the performance was quite a bit better, 0.704; still a lot worse than the deep learning model.

Please Join Our Community!

* indicates required
cindi.thompson Written by: