Machine Learning and Data Governance 2.0

Data governance and data quality management are some of the most painful aspects of enterprise data solutions. For decades, data platform vendors such as Informatica, Oracle, Microsoft and others have continuously attempted to solve the data governance and quality management challenges but none of their solutions achieved mainstream success.

Traditional data governance and data quality management platforms have typically relied on static rules that execute against well-defined data sources. Products such as SQL Server Data quality Services or Informatica Data Services provide Excel-like interfaces that can allow data stewards to identify patterns in datasets and create rules for data filtering and curation. Not surprisingly, this types of platforms have resulted too constrained to operate in modern data environments.

The limitation of traditional data governance and quality management platforms and lack of mainstream adoption made the entire market fall out of favor in the enterprise space. Recently, a new generation of startups has been reimagining the data governance and quality management market using new techniques such as machine learning and advanced data visualizations. Platforms such as Trifacta, Paxata, Tamr or Alation are great examples of this generation of technologies. Let’s call this movement data governance 2.0.

Conceptually, data governance 2.0 can be seen as a combination of three fundamental capabilities: data capture and quality management, data discovery and exploration and data security and governance. I know that data governance purists might have another 100 capabilities to add to my definition but, in my experience, most of those can be encapsulated into one of those three main groups.

1-Data Capture and Quality Management

Data governance 2.0 such as Trifacta or Paxata remove the friction of human create data quality rules with machine learning and sophisticated statistical models that facilitate the exploration of data sources, detect hidden patterns and formulate rules that curate and control their quality. By leveraging machine learning, data governance 2.0 platforms are not only able to create more advanced data quality management models but also adapt to changes in registered data sources without requiring heavy human intervention.

2 — Data Exploration and Discovery

How to implement advanced analytics or machine intelligence(MI) solutions when we don’t even know what data sources are available in my origination? Data catalog have become a very powerful mechanism to enable data exploration and discovery in enterprise settings. Tamr Data Catalog is one of the most innovative data discovery platforms in the market. Azure Data Catalog is another interesting initiative in the PaaS space.

3 — Data Governance and Security

Data privacy and access control policies are an essential element of data governance 2.0 platforms. Machine learning again plays an important role in this capability as models can detect vulnerabilities on and data sources and apply or recommend the appropiate security policies. Alation is a platform that has been pioneering this new type of data governance and security model.

CEO of IntoTheBlock, Chief Scientist at Invector Labs, I write The Sequence Newsletter, Guest lecturer at Columbia University, Angel Investor, Author, Speaker.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store