Understanding Semi-supervised Learning

Jesus Rodriguez
3 min readFeb 14, 2017

--

Semi-supervised learning(SSL) is one of the artificial intelligence(AI) methods that have become popular in the last few months. Companies such as Google have been advancing the tools and frameworks relevant for building semi-supervised learning applications. Google Expander is a great example of a tool that reflects the advancements in semi-supervised learning applications.

Conceptually, semi-supervised learning can be positioned halfway between unsupervised and supervised learning models. A semi-supervised learning problem starts with a series of labeled data points as well as some data point for which labels are not known. The goal of a semi-supervised model is to classify some of the unlabeled data using the labeled information set.

Some AI practitioners see semi-supervised learning as a form of supervised learning with additional information. At the end, the goal of semi-supervised learning models is to sesame as supervised ones: to predict a target value for a specific input data set. Alternatively, other segments of the AI community see semi-supervised learning as a form of unsupervised learning with constraints. You can pick your favorite school of thought ;)

Semi-Supervised Learning in the Real World

Semi-supervised learning models are becoming widely applicable in scenarios across a large variety of industries. Let’s explore a few of the most well-known examples:

— Speech Analysis: Speech analysis is a classic example of the value of semi-supervised learning models . Labeling audio files typically is a very intensive tasks that requires a lot of human resources. Applying SSL techniques can really help to improve traditional speech analytic models.

— Protein Sequence Classification: Inferring the function of proteins typically requires active human intervention.

— Web Content Classification: Organizing the knowledge available iun billions of web pages will advance different segments of AI. Unfortunately, that task typically requires human intervention to classify the content.

There are plenty of other scenarios for SSL models. However, not all AI scenarios can directly be tackled using SSL. There are a few essential characteristics that should be present on a problem to be effectively solvable using SSL.

1 — Sizable Unlabeled Dataset: In SSL scenarios , the seize of the unlabeled dataset should be substantially bigger than the labeled data. Otherwise, the problem can be simply addressed using supervised algorithms.

2 — Input-Output Proximity Symmetry: SSL operates by inferring classification for unlabeled data based on proximity with labeled data points. Inverting that reasoning, SSL scenarios entail that if two data points are part of the same cluster (determined by a K-means algo or similar) their outputs are likely to be in close proximity as well. Complementarily, if two data points are separated by a low density area, their output should not be close.

3 — Relatively Simple Labeling & Low-Dimension Nature of the Problem: In SSL scenarios, it is important that the inference of the labeled data doesn’t become a problem more complicated than the original problem. This is known in AI circles as the “Vapnik Principle” which essentially states that in order to solve a problem we should not pick an intermediate problem of a higher order of complexity. Also, problems that use datasets with many dimensions or attributes are likely to become really challenging for SSL algorithms as the labeling task will become very complex.

In a future post, I will cover some of the fundamental types of SSL algorithms.

--

--

Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, President of NeuralFabric and founder of The Sequence , Lecturer at Columbia University, Wharton, Angel Investor...