Due to its ability to facilitate database querying for non-experts (or casual) users, Natural language interfaces (NLI) for databases have a significant traction both in industry and academia.
One of the most important characteristics in NLI that the users do not have to know the underlying structure of the query language. All they have to do is formulate the question in natural language. But some NLIs restricted and sometimes controlled by some constraints.
For example, as (SODA) Search Over Data warehouse proposed. SODA is based just on the English keywords, for instance, the input question in this system could be as follow:
SODA= movie Brad Pitt.
Which make it depends on “ English keywords constraints ”
In contrast, ATHENA can handle full sentences in English, for the same example the question will be something like :
ATHENA = Show me all movies with the actor Brad Pitt.
But Ginseng strictly guides the user to a full correct sentence in English :
Ginseng = What are the movies with the actor Brad Pitt?
And the corresponding SQL statement will be formulated depending on its own database schema.
The systems of NLI use very different methodologies as it appears in the recent years. In this blog post, we will categorize these systems based on the methodology they are following.
As the paper “ A comparative survey of recent natural language interfaces for databases ” Proposed, We can have four categories :
In this first part we will talk about the first two categories.
The Distinct feature of this approach is the simplicity. In order to achieve it , These systems are trying to match the keywords against both inverted indexes (base and metadata).
An appropriate example for this approach, SODA is an NLI that expects only keywords from the user. The base in SODA consists of the relational database. It start with lookup and checks the keywords against the inverted and then provides all the nodes in the metadata graph that matched the keywords.
Credit: A comparative survey of recent natural language interfaces for databases .
Then SODA uses a simple heuristic method to assign a score to each solution of the lookup in order to identify which tables are used for each of these solutions. And using these information it generates a reasonable and executable SQL query.
One of the weaknesses of SODA system is that as a solution for comparison operations like “greater than , less than ” it uses simple word recognition. For example in order to retrieve all movies with a rating greater than 9, the input query should be exactly written like ‘ rating > 9 ’. And SODA also uses a very strict syntax for aggregation operators. These kinds of patterns can be useful, but are not usually used in natural language.
On the other hand, Précis is also a keyword-based NLI for relational databases, but with an additional strength that it supports multiple terms combined through the operators like “AND, OR and NOT”.
For example, input query like ‘Show me all drama and comedy movies.’ would be formulated as ‘“drama” OR “comedy”.’
At the end the common weakness is still the limitation of a keyword-based NLI.
These systems are trying to expand the keyword-based logic to handle and add natural language patterns to answer more complex questions like concepts or aggregations.
To indicates the aggregation it needs at least some linking phrase between the keywords. This could be done with the non-keyword token (trigger word) ‘by’ for the aggregation.
Most of these systems are highly dependent on the users input to solve ambiguity problems.
For example NLQ/A is an NLI to query a knowledge graph. The trick in this system is that it does not depend on NLP technologies like Parsers or PoS taggers.
The reason for this is that errors are made by these technologies are likely to lead to system failure. For example, a parse tree can help for certain query, but if the parse tree is wrong, the system will fail even in simpler questions.
Instead, NLQ/A depends on the user interaction to solve all ambiguity problems, by providing a greedy approach for this interaction process. In NLQ/A system the phrases are extended according to a synonym dictionary and then mapped to the knowledge graph. There can be multiple candidate mappings. So in the next step they try to find the true meaning of the input query with the help of the users. To reduce these interactions, a phrase dependency graph (PDG) is proposed.
Credit: A comparative survey of recent natural language interfaces for databases .
Nevertheless, One of the weaknesses of this system remains the need for more than one user interaction to resolve ambiguities.
In this first part , We have talked about the first two categories. In the second part, we will talk about the last two categories of Natural Language Interfaces for Databases.