This Cool Technology Brings Natural Language Processing to Databases

Jesus Rodriguez
2 min readSep 20, 2017

Seq2SQL is one of the coolest developments in natural language processing(NLP) for business applications in the last few years. Created by Salesforce.com, Seq2SQL allows users to leverage natural language to query information stored in databases.

Using natural language to interact with databases has long been an elusive goal of information workers. However, recent advancements in NLP technologies have addressed most of the limitations that caused the first wave of technologies in the space to be notoriously ineffective. Seq2SQL now represents the initial step towards making natural language a universal vehicle to interact with databases.

The specifics of Seq2SQL were outlined in a recent research paper published by Salesforce. One of the most notable contributions of Seq2SQL is the use of reinforcement learning to evaluate the validity of queries and output and improve the behavior of the framework over time. WikiSQL is another cool addition to the Seq2SQL stack. Conceptually, WikiSQL provides a method for training systems that connect natural language to database systems. As you might have guessed, WikiSQL uses Wikipedia as the underlying dataset from which it derives its initial knowledge. Interestingly enough, Salesforce leveraged crowdsourcing techniques powered by Amazon Mechanical Turk to label Wikipedia datasets so that they can be used in natural language queries.

Some Ideas for Seq2SQL Short Term Roadmap

The release of Seq2SQL is a very exciting development to streamline the application ot natural language in database systems. However, the integration of NLP and line of business systems is far from trivial from a technical standpoint. Here are a few ideas to consider for the short-term roadmap of Seq2SQL:

1 — Training Tools

Tools that allow data scientists to train Seq2SQL to interact with heterogeneous databases and line of business systems is a must in order to lower the entry point for the adoption of the technology. Similarly, some of the knowledge built during specific training exercises can be reused by other instances of Seq2SQL.

2 — Ambiguity Challenge #1: Single-Query and Multiple-Answers

Ambiguity remains one of the biggest challenges integrating natural language and database systems. One of the manifestations of ambiguity is the fact that a tool like Seq2SQL can produce multiple “correct answers” as a response to the same query. Curation mechanisms and tools can help to address this type of behavior in Seq2SQL solutions.

3 — Ambiguity Challenge #2: Multiple-Queries and Same-Answer

Another example of ambiguity in tools like Seq2SQL is represented by multiple queries with the same intent that produce the same output. Reinforcement learning and curation tools are essential to address this behavior.

4 — Support for Different Languages

Enabling translations from multiple languages to SQL is also going to become an interesting challenge for the mainstream adoption of Seq2SQL. Fortunately, advanced NLP stacks today provide great support for a large number of natural languages and dialects.

5 — Response Narrative

Today, Seq2SQL is mostly conceived as an interfaces that processes natural language queries and produces data outputs. In a future version, the framework should consider producing narratives based on the data outputs. That capability will be key to enable voice interfaces that leverage Seq2SQL.

--

--

Jesus Rodriguez

CEO of IntoTheBlock, President of Faktory, President of NeuralFabric and founder of The Sequence , Lecturer at Columbia University, Wharton, Angel Investor...