Google Research recently announced another important addition to its AI portfolio: Expander, a framework to enable semi-supervised learning in machine learning (ML) algorithms.
For years, the ML world has been divided into two main types of algorithms: supervised and unsupervised. Supervised algorithms can be seen as models with predictive capacity based on training using large amounts od data. On the other hand, unsupervised models such as clustering algorithms are able to operate without previous training.
While most experts agree that unsupervised algorithms are the future of ML, supervised models are more common and efficient in the current state of the market. Despite its popularity, supervised algorithms often run into the challenge of having to collect enough high quality data for the training processes. This is precisely the problem Google is trying to solve with Expander.
Conceptually, the goal of Expander is to power ML algorithms with some but minimum supervision. To enable that capability, Expander leverages new techniques known as semi-supervised learning that bridges the gap between known and unfamiliar data. Semi-supervised learning merges known and novel data as part of the training process and infers relationships between those data sets. This approach highly contrasts with supervised ML models such as neural networks that need to be trained upfront using high-quality, well-labeled data.
Functionally, Google Expander leverages large-scale, graph-based learning to infer knowledge about a specific data source. Specifically, Expander builds a multi-graph representation of a data source on which nodes correspond to objects or concepts and edges connect nodes that share similarities. The graph should contain both known and unknown data.
The magic of semi-supervised learning relies on the efficiency of labeling the unknown data sets by leveraging the characteristics of its neighbors in the graph. Expander tackles this problem by using an optimization technique called streaming approximation. This technique uses a streaming algorithm to process information propagated from neighboring nodes in a way that can be scaled to large graphs.
To get an idea of how semi-supervised learning works, let’s use an example of a sentiment analysis process. When analyzing a text, the first step will be to create a graph in which the nodes are the words and the edges the relationship between them. After that, the algorithm will start labeling known words that express specific emotions and proceed to apply the streaming approximation techniques to label the unclassified words. During that process, the algorithm will create links between the newly labeled words and the words that represent specific emotions they are related to.
Google Expander is already being used in large scale systems such as the Allo assistant and we should expect to see more of these techniques in the near future. Semi-supervised learning offers an exciting middle ground that can help improve the applicability of many of the well known ML techniques to p-ractical problems.