Technology Fridays: DataPrep Completes Google Cloud Data Science Pipeline
Welcome to Technology Fridays! Today, I would like to discuss a recent addition to the Google Cloud platform that enables a fundamental building block for data science applications and one that is missing in most platform as a service(Paas) stacks. Data wrangling and curation has been the missing element in Paas technologies such as Google Cloud. Now the search engine giant is addressing that limitation with the launch of Cloud DataPrep and a very intriguing partnership with one of the new generation of data quality technology providers.
Conceptually, Cloud DataPrep is a native cloud service that enable the preparation and curation of datasets. In terms of the Google Cloud platform, DataPrep sits in the middle between data storage services such as BigQuery or BigTable and advanced data science such as DataView or Cloud Machine Learning.
In a somewhat surprising move, Google didn’t built DataPrep from the ground up. Instead, the cloud incumbent decided to adapt the technology from data wrangling market leader Trifacta to its suite of cloud services. We’ve previously discussed Trifacta in this blog before but you should think about it as one of the new generation data quality management platforms that has emerged as an alternative to traditional solutions from incumbents such as Informatica Data Services or SQL Server Data Quality Services. Trifacta leverages machine learning algorithms to streamline the traditionally manual data curation processes. The result is an incredibly sophisticated data wrangling suite of tools that streamline the quality curation of data across an enterprise data pipeline. Cloud DataPrep is a version of the Trifacta suite optimized for Google Cloud data services.
Developers can start using Google Cloud DataPrep directly from the Google Cloud Console. The first step in a Cloud DataPrep solution is selecting the datasets that need to be wrangled. DataPrep supports two main types of datasets: Imported and Wrangled. Imported datasets are a reference to source data residing in data storage systems such as Files, databases, big data file systems, etc. Complementarily, Wrangled datasets are used to transform the source data using Recipes.
Atomic data cleansing and transformation steps in Cloud DataPrep are abstracted using Recipes. Typically, Recipes are authored using the Transform Editor tool and converted into Wrangle which is DataPrep’s domain-specific language for data transformations. Wrangle provides a flexible syntax that express structural transformations across single and multiple data source models. At runtime, Cloud DataPrep intercepts the Recipes and executes the corresponding Wrangle code.
While Recipes represent atomic data cleansing steps, DataPrep Flows are orchestrations involving multiple Recipes. Flows are typically used in more complex data cleansing tasks such as merging multiple datasets, identifying statistical outliers in the source data or many other data curation processes.
Visual Profiling is another key capability of Cloud DataPrep . The platform’s profiling tools include interactive visualizations that highlight key statistical patterns in datasets and recommend the appropriate transformations. A rich user experience combined with sophisticated statistical and machine learning algorithms make Visual Profiling and almost enjoyable experience in Cloud DataPrep certainly contrasting with the cumbersome profiling process in traditional data quality management stacks.
Cloud DataPrep completes the circle in Google Cloud’s data science pipeline that includes services in areas such as storage, transformation, analytics and data science. More importantly, data wrangling services such as DataPrep are a strong differentiator of Google Cloud against competitive platform such as Azure or AWS.
Google Cloud DataPrep is a unique offering in the Peas market. However, the platform faces competition from a new fast growing group of data quality management platform startups such as Paxata, Alation or Tamr. Additionally, Cloud DataPrep is likely to be perceived as an alternative to traditional data quality management incumbent solutions such as Informatica Data Services or Microsoft SQL Server Data Quality Services.