Machine learning has proven to be very efficient at classifying images and other unstructured data, a task that is very difficult to complete classic rule-based software. Before machine learning models can perform classification tasks, however, they must be trained using many commented examples. Data annotation is a slow and manual process where the human has to review the training examples one by one and give them the correct labels.
In fact, data annotation is such an important part of machine learning that the growing popularity of the technology has created a huge market for labeled data. From Amazon Mechanical Turk to startups like LabelBox, ScaleAI and Samasource, there are dozens of platforms and companies whose job it is to annotate data in order to train machine learning systems.
Fortunately, some classification tasks don’t require you to tag all of your training examples. Instead, you can use semi-supervised learning, a machine learning technique that can automate the data labeling process with a little help.
Supervised versus unsupervised versus semi-monitored machine learning
You only need labeled examples for supervised machine learning tasks that require you to provide the basic truth for your AI model during training. Examples of supervised learning tasks are image classification, face recognition, sales forecast, customer churn forecast and Spam detection.
Unsupervised learning, on the other hand, deals with situations where you don’t know the basic truth and want to use machine learning models to find relevant patterns. Examples of unsupervised learning are Customer segmentation, network traffic anomaly detection and content recommendation.
Semi-supervised learning stands somewhere between the two. It solves classification problems, which means that you ultimately need a supervised learning algorithm for the task. At the same time, you want to train your model without tagging every single training example for which you are receiving help from unsupervised machine learning techniques.
Semi-supervised learning with clustering and classification algorithms
One way to do semi-supervised learning is to combine clustering and classification algorithms. Clustering algorithms are unsupervised machine learning techniques that group data together based on their similarities. The clustering model helps us find the most relevant samples in our data set. We can then tag these and use them to train our supervised machine learning model for the classification task.
Suppose we want to train a machine learning model to classify handwritten digits. However, we only have a large data set of unlabeled digits. Commenting on each example is out of the question, and we’d like to use semi-supervised learning to build your AI model.[Read: How Netflix shapes mainstream culture, explained by data]
First, we use k-means clustering to group our samples. K-means is a fast and efficient unsupervised learning algorithm that does not require labels. K-means calculates the similarity between our samples by measuring the distance between their features. In our handwritten digits, each pixel is considered a feature, so a 20 × 20 pixel image consists of 400 features.
K-means clustering is a machine learning algorithm that arranges unlabeled data points around a specified number of clusters.
When training the k-means model, you need to specify how many clusters you want to divide your data into. Of course, since these are digits, our first impulse could be to choose ten clusters for our model. Note, however, that some digits can be drawn in different ways. For example, here are different ways you can draw the digits 4, 7, and 2. You can also think of different ways that you can draw 1, 3, and 9.
Therefore, the number of clusters you choose for the k-means machine learning model should generally be larger than the number of classes. In our case we select 50 clusters that should be enough to cover different types of digit drawing.
After training the k-means model, our data is divided into 50 clusters. Every cluster in a k-means model has one Centroid, a series of values that averages all features in this cluster. In each cluster, we select the most representative image that is closest to the focus. This leaves us with 50 pictures of handwritten digits.
Now we can label these 50 images and use them to train our second machine learning model, the classifier, which can be a logistic regression model artificial neural network, a support vector machine, a decision tree, or any other type of supervised learning machine.
Training a machine learning model on 50 examples instead of thousands of images might sound like a terrible idea. However, since the k-means model selected the 50 images that were most representative of the distribution of our training data set, the result of the machine learning model will be remarkable. The above example, taken from the excellent book Hands-on Machine Learning with Scikit-Learn, Keras and Tensorflow, shows that training a regression model with just 50 samples selected by the clustering algorithm leads to 92 percent accuracy (the implementation in Python can be found in this Jupyter notebook). In contrast, training the model on 50 randomly selected samples leads to an accuracy of 80 to 85 percent.
But we can still get more out of our semi-supervised learning system. After we have labeled the representative samples of each cluster, we can pass the same label on to other samples in the same cluster. With this method we can add just a few lines of code to thousands of training examples. This will further improve the performance of our machine learning model.
Other semi-supervised machine learning techniques
There are other ways to perform semi-supervised learning, including semi-supervised support vector machines (S3VM), a technique that was introduced at the 1998 NIPS conference. S3VM is a complicated technique and is beyond the scope of this article. However, the general idea is simple and not much different from what we just saw: you have a training data set made up of labeled and unlabeled samples. S3VM uses the information from the labeled data set to calculate the class of the unlabeled data and then uses this new information to further refine the training data set.
The Semi-Supervised Support Vector Machine (S3VM) uses labeled data to approximate and adjust the classes of unlabeled data.
If you’re interested in semi-supervised support vector machines, read the original paper and read chapter 7 of Machine learning algorithms that examine different variations of support vector machines (an implementation of S3VM in Python can be found Here).
An alternative approach is to train a machine learning model on the labeled part of your data set and then use the same model to generate labels for the unlabeled part of your data set. You can then use the full data set to train a new model.
The Limits of Semi-Supervised Machine Learning
Semi-supervised learning does not apply to all supervised learning tasks. As with the handwritten digits, your classes should be separable by clustering techniques. Alternatively, as in S3VM, you must have enough labeled examples and those examples must adequately depict the data generation process of the problem area.
However, if the problem is complicated and your flagged data is not representative of the entire distribution, semi-supervised learning will not help. For example, if you want to classify color images of objects that look different from different angles, semi-supervised learning can help a lot unless you have a lot of labeled data (but when you already have a large volume of labeled data). Why use semi-supervised learning? ). Unfortunately, many real world applications fall into the latter category, which is why data labeling jobs are not going away anytime soon.
However, semi-supervised learning still has many uses in areas such as simple image classification and document classification where automation of the data labeling process is possible.
Semi-supervised learning is a brilliant technique that can come in handy once you know when to use it.
This article was originally published by Ben Dickson on TechTalks, a publication that examines technology trends, how they affect the way we live and do business, and what problems they solve. But we also discuss the evil side of technology, the darker effects of the new technology, and what to look out for. You can read the original article here.
Published on January 18, 2021 – 11:00 UTC