📕 Edge#23: Machine Reading Comprehension, SQuAD 2.0 from Stanford University; and the spaCy framework
TheSequence is a unique newsletter that helps you build and reinforce your knowledge about machine learning and AI five minutes at a time
This is an example of our Premium newsletter TheSequence Edge. Become a paying subscriber to receive it every Tuesday and Thursday. Til September 30th, it’s only 40$/year.
Let’s dive into AI knowledge.
In this issue:
we explore the concept of Machine Reading Comprehension;
we evaluate the SQuAD 2.0 dataset from Stanford University;
we discuss the spaCy framework.
Enjoy the learning!
💡 ML Concept of the Day: What is Machine Reading Comprehension?
In TheSequence Edge#21, we covered the topic of question-answering (QA) models, which is currently one of the most active research disciplines in natural language understanding (NLU). Today, we would like to expand into the topic of QA models, by focusing on another discipline that is getting a lot of attention from the deep learning research community: machine reading comprehension (MRC). Conceptually, MRC looks to replicate humans’ cognitive ability to understand a text with little or no previous context. If we want to test someone’s understanding of a given text, we typically ask different questions with various degrees of complexity. MRC looks to recreate that ability in deep learning models.
MRC can be seen as a subset of the QA field. In some contexts, MRC problems can be formulated as a combination of QA and text-generation (TG) methods which are needed to produce textual answers to questions. Arguably, the biggest difference between QA and MRC models is the ability of the latter to expand beyond fact-finding questions. For instance, given a set of documents, a typical QA model could answer factual questions such as “what is this?” or “who did that?” effectively. However, the same model might fail in questions that require a certain level of reasoning such as “what should… ?” or “how do I….?”. Those types of questions are a typical example of “knowing what you don’t know” and are the core focus of MRC models.
From a machine learning standpoint, MRC methods are supervised models that are trained in certain groups of question-answering pairs as well as a contextual representation of a problem. After that, given a question and a given context, an MRC model will attempt to formulate the correct answers, many of which could be in long-text forms. In recent years, MRC has become a very active area of research, triggering the creation of many datasets and benchmarks to evaluate the effectiveness of MRC models. Furthermore, many MRC methods have been already implemented in mainstream deep learning frameworks.
🔎 ML Research You Should Know: SQuAD 2.0 is One of the Top Datasets for MRC Systems
In the paper Know What You Don’t Know: Unanswerable Questions for SQuAD, researchers from Stanford University presented SQuAD 2.0, a new dataset for the training of MRC models.
The objective: Expand the previous version of the SQuAD dataset with a series of highly contextualized questions that evaluate the proficiency of MRC models.
Why is it so important: There are only a few datasets that are viable for training MRC models. SQuAD 2.0 has been established as one of the go-to options for researchers exploring MRC methods.
Diving deeper: The first version of Stanford Question Answering Dataset (SQuAD) included a dataset of questions and answers used to train NLU models. The architecture of SQuAD 1.0 was focused on traditional question-answering (QA) systems that can locate the right answer to a question in the context of a given set of documents. Those models, however, often fail to find the right answer in cases where the answers are not explicitly stated in the documents or, in other words, when the answers require some “reasoning”. Addressing those scenarios was the main objective of SQuAD 2.0.
SQuAD 2.0 expands the original value proposition of SQuAD 1.0 with over 50,000 questions that require a certain level of machine reading comprehension (MRC). The SQuAD 2.0 benchmark focuses not only on reasoning through text in order to find the right answer, but also determining when there is no answer at all.
For creating SQuAD 2.0, the Stanford researchers crowdsourced the creation of “unanswerable questions” from a previous version of the dataset. Each crowdsourced worker will perform a number of tasks that consist of taking an article of the dataset and generating five questions per paragraph that can’t be answered by just reading that paragraph, requiring a level of comprehension of the entire article. The questions should incorporate semantical tricks such as the use of negation, antonyms or mutual exclusion expressions. The following figure shows some of the “unanswerable questions” and plausible but wrong answers included in the SQuAD 2.0 dataset.
Image credit: The Original Paper
SQuAD 2.0 is one of the richest and most widely used datasets in MRC scenarios. QA models trained on the SQuAD 2.0 dataset exhibit levels of reasoning far superior to those trained in vanilla QA datasets.
🤖 ML Technology to Follow: spaCy
Why should I know about this: spaCy is a widely-used open-source framework in NLU systems.
What is it: spaCy is an open-source NLU framework designed for flexibility and scale. The framework is actively used by technology giants such as Uber, Airbnb, Quora, and even the Allen Institute for AI, the creators of AllenNLP (see TheSequence Edge#22). spaCy was originally designed by the team from Explosion AI and has built an active community of contributors. The system was built on Python and Cython.
Functionally, spaCy abstracts many of the core building blocks of NLU applications both from the linguistic and machine learning standpoint. Some of the core capabilities of spaCy include:
Tokenization: Segmenting text into words, punctuation marks etc.
Part-of-speech (POS) Tagging: Assigning word types to tokens, like verb or noun.
Dependency Parsing: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Lemmatization: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.
Named Entity Recognition (NER): Labelling named “real-world” objects, like persons, companies and locations.
Entity Linking (EL): Disambiguating textual entities to unique identifiers in a Knowledge Base.
Similarity: Comparing words, text spans and documents and how similar they are to each other.
Text Classification: Assigning categories or labels to a whole document, or parts of a document.
Rule-based Matching: Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
The spaCy architecture centers around the Doc and Vocab constructs. Docs abstract a sequence of tokens and their corresponding annotations while Vocab links to a series of lookup tables that contain information across several documents. Other relevant components include the Tokenizer, which is responsible for building the Docs objects, and the Language object, which coordinates all the other components.
Image credit: spaCy
spaCy includes several pre-trained models and tools that accelerate the development of NLU solutions. Given that spaCy has been built on Python, it provides native integration with deep learning frameworks.
How can I use it: spaCy is open source and available at https://github.com/explosion/spaCy
🧠 The Quiz
Now, to our regular quiz. After ten quizzes, we will reward the most active participants. The questions are the following:
Which combination of the following natural language understanding (NLU) methods are the basics of machine reading comprehension (MRC)?
When creating SQuAD 2.0, Stanford researchers crowdsourced the creation of “unanswerable questions”. What makes these questions unanswerable?
Please use an email you are signed up with, so we can track your success.
That was fun! Thank you. See you on Thursday 😉
You’re on the free list for the Sunday TheSequence Scope. To the full experience every Tuesday and Thursday, become a paying subscriber for TheSequence Edge.