Some Key Facts About Anthropic’s Claude 2 Release

Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.

Jul 16, 2023

Next Week in The Sequence:

Edge 309: Our series about foundation model techniques continues with a look at active prompting including an analysis of the original AP paper. We also look at the Microsoft Guidance framework for language model programming.
The Sequence Chat: A special interview with a top LLM researcher/
Edge 310: We look at how OpenAI uses GPT-4 to interpret neuron functions in other LLMs.

Go subscribe :)

📝 Editorial: Some Key Facts About Anthropic’s Claude 2 Release

In the ever-changing generative AI market, Anthropic with its Claude model and Google with PaLM are universally regarded as the closest competitors to OpenAI. Every new release of these models causes a lot of expectations and triggers inevitable comparisons to GPT-4. Well, last week it was Anthropic's turn with the release of its highly anticipated Claude 2 model. The new release builds on its predecessors following Anthropic's Constitutional AI methodology.

A lot has been written about the Claude 2 release at a high level, but here are some key facts that you must know to avoid misconceptions:

Architecturally, Claude 2 is very similar to Claude 1.3.
Claude 2 supports 100,000 tokens (approx. 75,000 words) per prompt, which is substantially larger than alternatives. However, this context window is not new to Claude 2. The previous release, Claude 1.3, also supports 100,000 tokens.
Tests showed that Claude 2 is able to process up to 200,000 tokens and still exhibit performance gains. That's super impressive.
Claude 2 has been pretrained using data current until early 2023, making it more current than GPT-4.
In human feedback evaluations, Claude 2 improved over its predecessors in helpfulness and honesty of the answers. However, it scored similarly to Claude 1.3 when evaluated on the vulnerability of producing harmful answers.
Claude 2 really improved coding skills, scoring 71% in the Codex Human Eval benchmark relative to 56% of its predecessor.
In mathematical problem solving, Claude 2's improvements were not that noticeable, scoring 88% in the GSM8K benchmark compared to 85% by Claude 1.3.
A bit crazier is that Claude 2 scored 76% and 68% on the Multistate Bar Examination and the Medical Licensing Examination tests.

Overall, Claude 2 represents a very impressive release and a very viable alternative to GPT-4. The model is already powering mission critical applications across many industries and we expect to see more of that in the future.

🔎 ML Research

The Stepwise Nature of SSL

Researchers from Berkeley AI Research(BAIR) lab published a paper detailing the first mathematical picture of the learning of self-supervised learning(SSL) algorithms. The paper finds that SSL methods start learning embeddings relatively slow and the increasingly evolves in an stepwise learning process —> Read more.

LongLLaMA

Google DeepMind and researchers from Polish Academy of Sciences and Warsaw University published a paper detailing LongLlaMA, an LLM able to process 256k tokens. The model is based on focused transformer which uses contrastive learning to improve the context length —> Read more.

AI Institutional Governance

DeepMind published a paper detailing the functions that international institutions should implement to manage risks and opportunities of advanced AI → Read more.

Visual Question Answering via Code Generation

Google Research published a paper discussing CodeVQA, a framework for visual question answering using program synthesis. CodeVQA generates Python functions to answer questions about specific images —> Read more.

Symbol-Tuning and ICL

Google Research published a paper unveiling a technique called symbol tuning, which improves in-context learning in LLMs by emphasizing input level mapping. Initial tests show that symbol tuning makes LLMs much more resilient to previously underspecified prompts —> Read more.

🤖 Cool AI Tech Releases

Claude 2

Anthropic announced Claude 2 with improve conversational, math and coding capabilities —> Read more.

Data Agents

LlamaIndex introduced Data Agents, a framework for augmenting LLMs with data access capabilities —> Read more.

H2OGPT and LLM Studio

H2O.ai introduced H2OGPT and LLM Studio, a suite of open source LLM tools to help companies built conversational agents —> Read more.

Tongyi Wanxiang

Alibaba Cloud introduced Tongyi Wanxiang, a new generative AI model available for Chinese enterprises —> Read more.

PyRCA

Salesforce Research open sourced PyRCA, an ML library for RCA analysis in IT operations —> Read more.

🛠 Real World ML

Feature Engineering at Airbnb

Airbnb discusses Chronon, a declarative framework for ML feature engineering —> Read more.

📡AI Radar

Elon Musk unveiled x.AI, its anticipated OpenAI competitor.
Google opened early access NotebookLM, a note taking application with embedded language intelligence.
The US Federal Trade Commission opened an investigation into OpenAI for potential violation of consumer protection laws.
Stability AI unveiled Stable Doodle, a sketch-to-image tool.
Sapphire Ventures announced plans to invest over $1 billion in enterprise AI startups.
Pano AI, a startup that uses computer vision techniques to detect wildfires, announced a $20 million funding round .
eBay acquired AI-powered fashion platform Certilogo.
Prolific announced a $32 million round to stress tests AI models using a human workforce.
Prompt engineering platform Vellum.ai closed a $5 million seed round.
AI compliance startup Vendict comes out of stealth mode with $9.5 million in funding.
Generative AI voice platform Resemble AI announced an $8 million funding round.
OpenAI and Associated Press announced a strategic alliance for news and content sharing.
AI-powered drug discovery platform Causaly announced a $60 million funding round.

Richard Hackathorn

Another excellent issue! Thank you...

https://twitter.com/hackathorn/status/1680562462519640066

Expand full comment

Zbigniew Łukasiak

When I read about the math scores or BAR exam scores I always wonder about the methodology. It sounds very high, does it memorise all the theorems needed, all the laws? What if you do do it with RAG?

TheSequence

Discussion about this post