top of page

Building an Accurate Image Classifier Using Tiny Training Sets: Getting Sneaky with Image Embeddings

Introduction to Building an Image Classifier

Image classification is a term that has become nearly ubiquitous across the IoT industry and is the task of having a computer decide the contents of an image without direct human interference. While this is a relatively easy process today – as few as ten years ago this was considered an incredibly difficult task. Today, image classification is more-or-less a solved problem. Take a large set of images representative of the classes you’re interested in, throw them at your favorite convolutional neural net or one of your own design, and let gradient descent take care of the rest. A few short GPU cycles later you’ll have a perfectly adequate classification model, likely with performance exceeding the state of the art five years ago.  

However – one piece of information here is often glossed over: the number of images required for reasonable generalization. Even in the least data intensive cases transfer learning from a larger pre-trained network to your specific task still takes hundreds to thousands of example images to train a sufficiently general classifier for most purposes. This task gets even harder if you’re concerned with correctly classifying more than one or two different classes of object. But what if you only have a dozen or less examples of each class you’re hoping to train? And you need to deploy a model tomorrow that will be able to correctly classify images supplied by your end user, and there’s no off the shelf model for you to take credit for? What are your options then? In this case, we find ourselves in the most exciting part of data science: an area where you need to get creative with what you have.  

We need to remember a few facts about how image classification neural networks typically work. Let’s suppose we are just considering some massive open-source image classification network that has been trained on billions of images – more than you or I without the funding of Google or Microsoft could ever hope to throw into training. Most of the work that these neural networks are doing is to turn the image into something you or I as humans would find totally incomprehensible: a vector embedding representing the contents of this image.

In layman terms, the first sections of any image classification network’s job are to take all the information encoded in our familiar RGB pixels and condense it down into some dense vector that means absolutely nothing to anyone or anything outside of the specific network you’re training. However, hiding in the zoo of floating-point numbers of this dense vector is all the information about that image. Shapes. Colors. Edges. Hopes. Dreams. All nicely encoded into some N-dimensional vector space normalized to have values between zero and one. These dense vectors are typically known as embeddings, and they have some incredibly useful properties.  

Our model’s one job is to find generalizable solutions to classify images with the embeddings of images that it generates. As a result, the embeddings generated to classify object A vs object B should “point” in different directions. Similarly, the multiple potential views, shapes, and sizes of object A should all point in a similar direction. In other words – the angle of the embedding between images which contain object A should be small, and the angle between the embeddings of images which contain object B and object A should be large. This is typically referred to as the cosine distance, and if our embeddings are properly normalized, this is simply the dot product between the vector of embedding A and embedding B. As it turns out, this is the only fact we need to know to build a, somewhat unconventional, but highly effective classification model using a small fraction of the example images typically needed to construct an image classifier.  

To ground the scene a little more, let’s assume that we are building an image classifier that will help us distinguish between different types of mining equipment such as different dozers, excavators, and other mobile equipment that one may find in a mine for a total of 12 unique classes. For a little extra challenge, let’s further assume that we have a few hard cases, such as trying to distinguish between different types of dozers such as the Caterpillar D8, D10 and D11, shown in Figure 1.

Figure 1: Three different types of dozer that we hope our classifier will be able to distinguish between.

In Figure 1, we notice that these represent a difficult classification case at the best of times – even a human layman may have trouble correctly classifying these pieces of equipment. Let’s see how well we can do using a little linear algebra and some machine learning (which is also linear algebra).

Building an Image Classifier

Let’s formalize our scenario a little more. Suppose we have (I) total images but can only use (N) for “training,” and for simplicity, let’s say we have (M) pictures of each of our C classes in (N). Let’s further assume that we can generate a reliable embedding for each of our (I) images. Now, the simplest thing we could do is assemble our (N) embeddings from our “training” set, or alternatively, reference embeddings, into a rectangular matrix (A), and define our classifier as the following:

Where (X) is the embedding of our image needing classification, and (C) is the predicted class or the row index of the embedding of our (N) reference images that is most like our image (X) now. This by itself will work… okay, especially if you’re looking to classify very different-looking things such as planets vs. refrigerators, as we would expect our reference embeddings to have a very small self-similarity or, the embeddings of our different classes will have a large angle between them. However, what if we have classes that may look quite similar, such as different varieties of industrial machinery? In a case like this, we are very likely to have frequent misclassification between “close” classes if we’ve chosen our (N) embeddings poorly. The question remains: how do we choose an appropriate subset from (N) such that we minimize the likelihood of image misclassification from our reference embeddings being too self-similar between classes?

Note that we can compute the self-similarity between our reference embeddings with this simple matrix multiplication (or dot-product):

Where (S) is the self-similarity between each embedding and each other embedding. For example, the first column of (S) will have entries representing the cosine distance between our first embedding and each other embedding contained within our reference set. The further from 1 each of these entries are, the less similar that embedding is to each other embedding, an example is shown in Figure 2.

We want to choose some subset of (N) that maximizes the cosine distance of our chosen embeddings from all other embeddings that are not contained within its own class. Truthfully, even within its own class – the idea being that we want to maximize how much information we must identify individual classes, while minimizing misclassification. How do we go about this?

Certainly, we could take a brute force approach and simply iterate through all possible subsets, come back in six months, and choose the best possible subset. If you’re paid by the hour, I would recommend this approach. In the event that you are not and hope to iterate quickly, it can be helpful now to no longer think of the matrix (S)as any other box of numbers, but rather, as a dense fully connected graph.

From this point of view, (S) defines an adjacency matrix of a graph where each node represents a reference image, and each edge is weighted by the cosine similarity of that image’s embedding and each other embedding. We are now interested in which nodes are “least connected” to each other node – in other words, we are looking for the least central set of nodes — the set which will minimize the self-similarity of our reference matrix (S). There are numerous measures of centrality, however for computational speed, an excellent choice for us is eigenvector centrality. Eigenvector centrality is defined as

Where the above is known as an eigenvalue equation, (S) is our self-similarity adjacency matrix, (C) is a so-called eigenvector of (S), and  (λ) is a scalar eigenvalue associated with the eigenvector (X). For eigenvector centrality, we are only concerned with the eigenvector associated with the greatest eigenvalue, and as such, after solving the above equation and choosing the largest eigenvalue and its corresponding eigenvectors, the centrality of the node  is defined as: 

Where (C) is the centrality of the ith node, and (SIJ) is, in our case, the cosine similarity between node i and j. So, to choose our set of images, we take some number of the embeddings with the smallest centrality score in each class.

Now, this on its own makes a better classifier than simply using all our reference images in a large matrix (A), however, this still requires us to make a choice of how many low-centrality reference images we should choose for each class – and a static number for each class may not have the performance that we are hoping for. If we recall from earlier, we asserted that we left a large set (compared to our “training” set) of images alone for testing. We will now use this training set to help us select the “best” number of training images by optimizing our selections against some metric of performance on the test set, and for notational ease, accuracy. Formally,

Here, (A) is a reduced embedding matrix which satisfies

Where (am,p) is a column of (A), p is an index representing the class of the image, the sum implies vector concatenation into a rectangular matrix, (cm,p) is the centrality measure of (am,p), C is an indicator function defining the class of the largest argument from the dot product, (αp) is a hyperparameter indicating the maximum centrality of reference images we’re going to allow in our search space for that class, and (Tn) is the true label associated with image (n).

We are maximizing the set {} images where the correct classifications, The above describes a discrete optimization problem where we are trying to find the subset of our images {i} which maximize classification accuracy. There exists a world of methodologies that will help us find local (and, if we’re lucky, global) optima to this equation, but for ease of implementation, we choose Bayesian optimization with several heuristics to facilitate convergence and a good selection of images. 


After following this optimization process to choose the embeddings which make up our classifier, we can view the distribution of self-similarity from the original set of images we can choose from, and our specially selected subset shown in Figure 3.

Figure 3: Here we display the histogram of reference image self-similarity from the original set of images that we were able to choose from, and the final selection of images. Note that the self-similarity of images within the same class (such as the block diagonal terms in Figure 1) have been removed from each distribution to only show the “cross class” self-similarity in the above histogram.

Of interest is that the distribution of self-similarity in our final classifier has shifted to the left, with the notable exception of the disjointed peak which represents the distribution of three very similar looking bulldozers that we are hoping to distinguish between.

After completing the optimization process, we find that we can correctly classify between 12 separate objects with an accuracy of 79% and a weighted F1 of 78% on the unseen validation set.

This is a stark improvement from using all available reference images which merely obtained an accuracy and weighted F1 of 64% and 60% respectively. However, it is worth noting that this final classification system is made up of a total of 71 reference images, with between 2 and 8 images for each class to achieve these results. Significantly fewer then would be required by alternative methodologies.


Here we demonstrate that by using the embeddings of an open-source pre-trained image classification network, it is possible to build an image classifier which requires but a pittance of images for considerable performance given the size of the data set available. While to a data scientist, more data is always better, but you’ll find in most situations getting more – and sometimes even any – data is not possible. And while we should be clear – if you have a large enough data set available to you, more conventional image classifiers will converge to state-of-the-art performance. However, often, you may find yourself without a large, clean, and specially crafted data set. In those cases, one should not give up hope, but rather, one should look at this as an opportunity for non-conventional approaches and a little creativity. After all – if every problem was as straightforward as the tutorial, we would all be replaced by the very AI we’re building.



bottom of page