🎙 Doug Downey/Semantic Scholar: Applying Cutting Edge NLP at scale

Dec 08, 2021

It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.

👤 Quick bio / Doug Downey

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Doug Downey (DD): I received my Ph.D. in Computer Science and Engineering from the University of Washington in 2008, focusing on AI – specifically, information extraction from large-scale text. Since then, I have been a professor at Northwestern University but have been away from the university since 2019, working full-time at the Allen Institute for AI (AI2). I manage the research unit of AI2's Semantic Scholar group focused on machine learning, natural language processing, and human-computer interaction in service of Semantic Scholar's mission of accelerating scientific breakthroughs with AI.

🛠 ML Work

Semantic Scholar is one of the most important projects within the AI research community. Can you tell us about the vision behind the project, the existing capabilities and the roadmap?

DD: Semantic Scholar aims to radically improve people's ability to identify and understand relevant research. In the past couple of years, we've rolled out new capabilities like automatically-generated "TLDRs" for papers, adaptive recommendation feeds for staying up to date with recent research, and improvements to core capabilities like search and author disambiguation. The tool we're most excited about today is the Semantic Reader, which aims to revolutionize reading by making it more accessible and contextual. The Reader already provides a seamless way to lookup references while reading without losing your place. And it is available for thousands of papers. We'll soon make it available for hundreds of thousands of papers, and we are exploring new features like assisted skimming, on-demand symbol and term definitions, and more.

Summarization is one of the key features of Semantic Scholar and one in which Allen Institute for AI has actively contributed with techniques such as Longformer. How much different is summarizing long-form research papers than shorter forms like news or articles?

DD: For TLDRs, interestingly, the long-form input doesn't change the problem too much. Our production model only uses the abstract, intro, and conclusion as input, which is not too long and tends to be sufficient for generating good TLDRs. But TLDRs only scratch the surface of what we might want to summarize from scientific papers. For example, say you're reading a paper, and it says, "we use the same experimental setup as reference 17." Wouldn't it be great if you could click on that statement and immediately get a concise summary of the relevant part of reference 17's experimental design? This is something we're working on. For that, we may need to model whole documents using a tool like Longformer, and even model multiple documents at a time, as in the cross-document language models that we introduced this past year.

Semantic Scholar provides some interesting capabilities in areas such as semantic search. What ML techniques have proven more efficient for extracting relevant information from long-form research papers?

DD: The Semantic Scholar website and the Reader face a difficult and expensive extraction challenge as a first step. Given a PDF, we have to pull out all of the basic paper elements – title, authors, citations, equations, figures, and so on. We recently introduced a new technique for this task called VILA (for VIsual LAyout), which uses a simple intuition that in typical scientific paper layouts, semantically-related elements tend to appear in the same visual block on the page. Fairly simple models that encode this intuition are more efficient than previous work and still get high accuracy. We've released a version of those and plan to improve them further in 2022.

Few-shot models seem to be gaining traction for some very specific NLP tasks, and this is an area that Allen Institute for AI seems to be heavily invested in. What are the current advantages and limitations of few-shot NLP when applied in real-world scenarios such as Semantic Scholar?

DD: Few-shot learning techniques are very relevant to Semantic Scholar in settings like feeds, where users might give just a handful of example papers and want to get high-quality recommendations right away, or in domain transfer, where we might have a model built for one scientific domain like computer science and want to quickly adapt it to bioinformatics. To help support additional few-shot research by the community, we recently established a new benchmark, FLEX, that provides a standard and realistic challenge workload for few-shot NLP. One limitation of current few-shot learning work, including FLEX, is that it tends to focus on classification, and other settings like few-shot text generation and summarization are far less studied. We've recently demonstrated a few-shot summarization technique (PRIMER), but more work is needed in this direction.

Research papers are all about reasoning and interpretation. Can we build forms of commonsense reasoning in NLP models? What are some of the important ML techniques that are making progress in this area?

DD: Great question. We'd like to get to a point where our systems understand science well enough to verify a claim in one paper by reading other papers, or suggest the best tools for a given task, or recommend new hypotheses for human scientists to investigate. Reaching that level of reasoning requires not just scientific knowledge but also vast commonsense knowledge. Semantic Scholar has collaborated with the MOSAIC team here at AI2 on a variety of commonsense challenge tasks, including ones that are critical for science like abductive reasoning (reasoning to the best explanation). Recent years have shown that large-scale language modeling approaches, especially when trained on large and varied commonsense datasets, can perform fairly well on the commonsense question-answering benchmarks that we devise. But it's not obvious how to convert performance on those constructed QA tasks into a successful real-life application, like a scientific assistant that suggests hypotheses. I hope we can work more with MOSAIC in this direction.

💥 Miscellaneous – a set of rapid-fire questions

Favorite math paradox?

The "surprise test" paradox.

What book would you recommend to an aspiring ML engineer?

Weapons of Math Destruction by Cathy O'Neil, although my own views on predictive modeling are more optimistic.

Is the Turing Test still relevant? Any clever alternatives?

It has flaws, but yes it's still relevant. In particular, the value of an interactive test is underappreciated by today's benchmark-focused AI research. To really evaluate an AI system, you have to interrogate it – not just test it on a fixed data set.

Does P equal NP?

Seems like not, and it would be so great to understand exactly why.

TheSequence

Discussion about this post