Microsoft’s Project Petridish Helps Find the Best Neural Network a Given Dataset

The new algorithm takes a novel approach to neural architecture search.


A Brief History of NAS

Given the recent popularity of NAS methods, many might think that NAS is a recent discipline. It is unquestionable that NAS has experienced a renaissance since 2016 with the publication of Google’s famous paper on NAS with reinforcement learning. However, many of its origin trace back to the late 1980s. One of the earlies NAS papers was the 1988 “Self Organizing Neural Networks for the Identification Problem”. From there, the space saw a handful of publication outlining interesting techniques but it wasn’t until the Google push that NAS got the attention of the mainstream machine learning community. If you are interested in the publication history of NAS methods, the AutoML Freiburg-Hannover website provides one of the most complete compilations up to this day.

The Two Types of NAS: Forward Search vs. Backward Search

When exploring the NAS space, there are two fundamental types of techniques: backward-search and forward-search. Backward-search methods have been the most common approach for implementing NAS methods. Conceptually, backward-search NAS methods, starts with a super-graph that is the union of all possible architectures, and learns to down-weight the unnecessary edges gradually via gradient descent or reinforcement learning. While such approaches drastically cut down the search time of NAS they have a major limitation in the case that they require human domain knowledge is needed to create a supergraph in the first place.


Petridish is a forward-search NAS method inspired by feature selection and gradient boosting techniques. The algorithm works by creating a gallery of models to choose from as its search output and then incorporating stop-forward and stop-gradient layers to more efficiently identify beneficial candidates for building that gallery, and uses asynchronous training.

  • PHASE 1: Petridish connects the candidate layers to the parent model using stop-gradient and stop-forward layers and partially train it. The candidate layers can be any bag of operations in the search space. Using stop-gradient and stop-forward layers allows gradients with respect to the candidates to be accumulated without affecting the model’s forward activations and backward gradients. Without the stop-gradient and stop-forward layers, it would be difficult to determine which candidate layers are contributing what to the parent model’s performance and would require separate training if you wanted to see their respective contributions, increasing costs.
  • PHASE 2: If a particular candidate or set of candidates is found to be beneficial to the model, then we remove the stop-gradient and stop-forward layers and the other candidates and train the model to convergence. The training results are added to a scatterplot, naturally creating an estimate of the Pareto frontier.

CEO of IntoTheBlock, Chief Scientist at Invector Labs, I write The Sequence Newsletter, Guest lecturer at Columbia University, Angel Investor, Author, Speaker.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store