The Sequence Scope: Synthetic Data in Machine Learning is Becoming Real

Weekly newsletter that discusses impactful ML research papers, cool tech releases, the money in AI, and real-life implementations.

Image for post
Image for post

The Sequence Scope is a summary of the most important published research papers, released technology and startup news in the AI ecosystem in the last week. This compendium is part of TheSequence newsletter. Data scientists, scholars, and developers from Microsoft Research, Intel Corporation, Linux Foundation AI, Google, Lockheed Martin, Cardiff University, Mellon College of Science, Warsaw University of Technology, Universitat Politècnica de València and other companies and universities are already subscribed to TheSequence.

📝 Editorial: Synthetic Data in Machine Learning is Becoming Real

Supervised learning dominates the current ecosystem of machine learning solutions. Among the many challenges of supervised methods, their dependency on large labeled datasets ranks as the highest. Compiling high-quality datasets is expensive and hard to scale. Many machine learning solutions fail due to the lack of labeled datasets. That problem is more accentuated for smaller organizations. Generative methods are a subset of deep learning that have the potential of generating fake datasets that match the distribution of labeled datasets, making it possible to accelerate the training of machine learning models.

Synthetic data generation techniques have mostly remained constrained to research efforts, but that’s changing rapidly. This week, machine learning startup Synthetaic announced a new round of funding for its synthetic data generation platform. If synthetic datasets prove to be successful in mainstream machine learning applications, it could become one of the phenomena that bridges the gap between the large technology companies that dominate the AI space and the rest of the world. Certainly, an important trend to follow in the next few years.

Share TheSequence

🔺🔻TheSequence Scope — our Sunday edition with the industry’s development overview — is free. To receive high-quality educational content every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻

🗓 Next week in TheSequence Edge:

Edge#33: the concept of secure multi-party computations (sMPC); Microsoft CRYPTFLOW — an Architecture for using sMPC in TensorFlow; Facebook’s CrypTen framework for sMPC implementations in PyTorch.

Edge#34: the concept of homomorphic encryption; Intel’s nGraph-HE that shows how neural networks can operate on homomorphically encrypted data; and Microsoft’s SEAL.

Now, let’s review the most important developments in the AI industry this week.

🔎 ML Research

New Language Milestone

Microsoft’s Turing Universal Language Representation achieved a new milestone by topping the challenging XTREME challenge ->read more on Microsoft Research blog

Translating Lost Languages

Researchers from MIT published an impressive paper proposing a symbolic method for deciphering ancient languages ->read more on MIT News

Identifying Machine-Generated Text

Salesforce Research published an insightful blog post detailing a RoBERTa based model for detecting machine-generated text ->read more on Salesforce Research blog

PPL Benchmark

Facebook AI Research published a paper proposing a benchmark for probabilistic programming languages->read more on the FAIR team blog

🤖 Cool AI Tech Releases


Facebook is open-sourcing M2M-100, a machine translation model capable of translating text in any pair of 100 languages ->read more on the FAIR team blog


DeepMind open-sourced FermiNets, a new framework for simulating quantum physics and chemistry scenarios ->read more on DeepMind blog

Adversarial Threat Matrix

Microsoft and MITRE open-sourced the Adversarial ML Threat Matrix, a framework enabling security analysts to detect, respond to, and remediate threats against ML systems ->read more on Microsoft blog

💸 Money in AI

  • Juniper Networks plans to acquire startup 128 Technology for $450 million. For Juniper Networks, it’s a round of consolidation in the SD-WAN sector. Software-defined wide area networks (SD-WAN) are networks that control the connectivity, management and services among data centers, remote offices, or cloud instances in a wide geographical area, rather than focusing on a network in a defined space. Juniper plans to blend 128 Technology into its AI-driven enterprise network portfolio. 128 Technology is the second AI-based networking company, after Juniper’s acquisition of Mist Systems in March 2019 for $405 million.
  • Distributed computing startupAnyscale raised $40 million in funding. Its solution is cloud-agnostic, enabling developers to share and run distributed projects, including massive distributed computing involved with AI and machine learning.
  • AI-powered visual search startup Syte raised $40 million. Visual search and recommendation engines are one of the most common use-cases for AI, and Syte’s uniqueness is in unprecedented accuracy. By amassing the largest vertical-specific lexicon in the fashion industry, the system recognizes objects within an image but also assigns them a variety of detailed product tags based on their visual attributes, which significantly improves recommendations.
  • AI-powered anti-bias software startup Zest AI has closed a $15 million funding round. In an attempt to minimize the potential model bias, the startup uses a technique called adversarial debiasing. In this scenario, one model attempts to predict the creditworthiness of a client, and the other model tries to distinguish the client’s race, gender and other attributes that can bias the score. Such competition makes both models improve their methods until the second model can no longer predict the attributes of a client.
  • Suspicious activity monitoring startup Unit21 has raised $13 million in funding. Unit21 provides a no-code service to detect fraud, money laundering, and other cross-industries risks. The company uses “100+ out-of-the-box machine learning and threshold-based rules, driven by the flexible case management system”.
  • AI-powered data analytics startup TeleSense raised $10.2 million. TeleSense focuses on crops: it analyzes data to predict crops’ quality in storage and transit. The sensors collect information, the software analyzes and identifies compromised storage conditions such as mold, insects, moisture, and others, notifying stakeholders if anything is detected.
  • Digital footprint analyzer Mine has raised $9.5 million in a Series A round. It’s like a smart data assistant that, once connected with an inbox, uses the algorithms to identify companies a user has interacted with via email, generating visualizations that show the type of data collected and the corresponding risk level. It analyzes what services a client no longer uses, and removes the client’s personal data from them.
  • Synthetaic, which we’ve mentioned in the editorial, has raised $3.5 million in seed funding. The startup works on synthesizing data — specifically images — to train artificial intelligence on it. An important trend to observe.
  • Computer vision startup Tiliter raised a $7.5 million Series A round. The startup uses AI to recognize products without barcodes and powers cashierless checkout tech.
  • Visual data security and privacy company Pimloc has raised $1.8 million in a seed funding round. On its website, the team gives an explicit explanation of what AI tools it uses to recognize and classify faces, objects, and scenes in images and video footage. It includes proprietary deep learning neural networks, as well as meta-data ingestion, data crawling, image annotation, synthetic data generator, QA/Audit, etc.

Written by

CEO of IntoTheBlock, Chief Scientist at Invector Labs, Guest lecturer at Columbia University, Angel Investor, Author, Speaker.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store