💬 Edge#21: Question-answering models; 300,000 natural questions in the new dataset; and the DeepPavlov framework~

Sep 15, 2020

In this issue:

we discuss the concept of question-answering models;
we overview a paper in which the Google Research team presents a new dataset for training and evaluating question-answering systems;
we evaluate DeepPavlov, an open-source framework for advanced NLU methods, including question-answering.

Enjoy the learning!

💡 ML Concept of the Day: Question-Answering Models

Natural language understanding (NLU) is one of the deep learning disciplines that has experienced remarkable growth in the last few years. Among the NLU disciplines receiving lots of attention from the research community, question-answering models are close to the top of the list. Just from the name, it's not hard to imagine what question-answering models do. Conceptually, question-answering focuses on building models that can infer answers to questions formulated in natural language against a specific dataset. Question-answering models have been part of machine learning since its inception, but they have certainly experienced a renaissance with the emergence of deep learning.

The universe of question-answering models can be divided into two main groups: closed-domain and open-domain. Closed-domain question-answering models focus on answering a limited set of questions about a specific topic or domain. These techniques have been popular to power machine reading applications in fields like telemedicine or research and discovery applications in legal fields. Open-domain question-answering is a more interesting and exponentially more complex challenge. The idea of open-domain question-answering is to create models that can answer questions about any topic and across an arbitrary set of documents.

Think about a question such as “What is the Aquaman actor’s next movie?”. While semantically simple, answering that question requires a level of information discovery and comprehension to first associate Jason Momoa with the already released Aquaman movie and then search for his next movie. And all that needs to be done across a series of documents that haven’t been previously labeled, which requires some level of machine comprehension.

Both closed and open-domain question-answering are active areas of research in the deep learning space, but open-domain scenarios represent a more fascinating challenge. In recent years, we have seen major breakthroughs in the open-domain question-answering space with techniques such as recurrent neural networks and transformer architectures. Some question-answering libraries are already included in major deep learning frameworks.

🔎 ML Research You Should Know: A Question-Answering Benchmark

In the paper Natural Questions: a Benchmark for Question Answering Research, Google Research presents a new dataset for training and evaluating question-answering systems.

The objective: Present a new dataset and methodology that helps data scientists build question-answering machine learning models.

Why is it so important: Question-answering research is relatively new and, as a result, there are not many high-quality datasets that can be used to train and evaluate models in the field.

Diving deeper: Open-domain question-answering is one of the most challenging tasks in the NLU space. Conceptually, open-domain question-answering focuses on emulating people by looking for information, as well as finding answers to questions by reading and understanding entire documents. While research in open-domain question-answering has advanced significantly in the last few years, the training and implementation of these types of models remain challenging. In part, this is due to the limited number of training datasets for open-domain question-answering in the current market. Assembling a high-quality open-domain question-answering labeled dataset is a very intensive exercise and is highly dependent on human annotators.

In its research paper, Google unveiled Natural Questions (NQ), a dataset optimized for the evaluation of open-domain question-answering models. The initial release contained over 300,000 natural questions with the corresponding human-annotated answers. Additionally, the NQ dataset includes over 16,000 examples where the answers are provided by five different annotators, which facilitates the evaluation of question-answering models.

In order to build NQ, Google leveraged a group of over 50 annotators. Each annotation task was divided into three main stages:

Question Identification: Annotators determine whether a question is factual or ambiguous.
Long Answer Identification: Annotators select paragraphs that provide a long-form answer to factual questions.
Short Answer Identification: Annotators identify an alternative short-form to a factual question.

Additionally, the answers are evaluated by experts in order to assert their correctness. The result is a dataset that can be used to train machine learning systems in almost any domain. Google maintains an open-source version of NQ that can be used to train open-domain question-answering models in modern deep learning frameworks.

🤖 ML Technology to Follow: DeepPavlov is an Open-Source Framework for Advanced NLU Methods Including Question-Answering

Why should I know about this: Even though NLU research in areas such as question-answering is rapidly advancing, there is still a lack of general availability of those new methods in mainstream deep learning frameworks. Researchers from the Moscow Institute of Physics and Technology created DeepPavlov to bridge the gap between advanced NLU research and mainstream applications.

What is it: Functionally, DeepPavlov is a framework for building advanced NLU models. The framework has been built on top of TensorFlow, Keras and PyTorch, offering data scientists a certain level of flexibility when it comes to choosing their deep learning development stack. DeepPavlov has been designed by NLU researchers for NLU researchers. There are plenty of NLU frameworks and platforms in the market, but most of them see every machine learning model as an isolated effort and struggle to rapidly incorporate the latest NLU models that interest researchers. DeepPavlov’s three main differentiators can be seen as the following :

Provide a single programming model for a variety of NLU tasks such as question-answering, text classification, entity linking, spelling correction and many others.
Provide a series of pre-trained models such as BERT that can be rapidly incorporated into NLU applications.
Include datasets that can be used in the training of new NLU models.

Functionally, DeepPavlov is based on three fundamental components:

A set of pre-trained NLU models and templates.
Tools for integrating NLU models with third-party applications.
A benchmarking environment with datasets that can be used to evaluate NLU models.

The DeepPavlov framework is based on three key architecture building blocks. The smallest unit in the framework is called Component, which is a reusable functional block. Components can be assembled into Models or Skills, which solve specific NLU tasks. Models can be any type of machine learning model, not necessarily focused on textual inputs and outputs. Skills are implementations of specific NLU tasks.

Image credit: DeepPavlov.AI

DeepPavlov includes a large set of pre-trained models and skills that allow researchers to rapidly start working on NLU tasks. Additionally, the platform provides integration with many machine learning platforms and frameworks, which enables the implementation of more sophisticated machine learning applications.

How can I use it: DeepPavlov is open source and available at https://github.com/deepmipt/DeepPavlov

🧠 The Quiz

Now, to our regular quiz. After ten quizzes, we will reward the winners. The questions are the following:

What’s the difference between closed and open-domain question-answering models?
What’s the main use case for DeepPavlov in NLU solutions?

Answer the questions

Please use an email you are signed up with, so we can track your success.

That was fun! Thank you. See you on Thursday 😉

TheSequence is a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. TheSequence keeps you up to date with the news, trends, and technology developments in the AI field.

5 minutes of your time, 3 times a week – you will steadily become knowledgeable about everything happening in the AI space. Make it a gift for those who can benefit from it.

TheSequence