The Most Important Algorithm for Transformers
FlashAttention has a new version. Plus some important research milestones and major funding activity in AI.
Next Week in The Sequence:
Edge 413: Our series about autonomous agents continues with an exploration of semantic memory. We review Meta AI’s MM-LLM research to augment video models with memory and we dive into the Qdrant vector DB stack.
Edge 414: We dive into HUSKY, a new agent optimized for multi-step reasoning.
You can subscribe to The Sequence below:
📝 Editorial: The Most Important Algorithm for Transformers
There are few algorithms that have had as much impact on the recent generation of transformer architectures as FlashAttention. Originally developed by researchers from Princeton University, including the renowned Tri Dao, FlashAttention and its successor FlashAttention-2 were able to improve the performance of attention mechanisms in GPUs by minimizing read-writes. Almost immediately after the original publication, FlashAttention was rapidly adopted within the new generation of transformers. There were not many complaints about FlashAttention, but one of the few was that it was unable to take full advantage of new hardware architectures. For instance, FlashAttention-2 is only able to achieve 35% utilization of max FLOPs in H100 GPUs.
But now we have a new version.
Last week, a group of AI researchers from Meta, Princeton University, NVIDIA, and other AI labs published the paper and open-source code for FlashAttention-3. The new version of the method uses several techniques to speed up attention in H100 GPUs, exploiting the asynchrony of the tensor cores. The result is simple: FlashAttention-3 is blazing fast. The new model achieves 75% theoretical max FLOP utilization in H100, which results in practical 1.5-2x performance improvements. The new algorithm is also able to use lower precision numbers, which reduces the memory footprint.
FlashAttention-3 is an exciting development in generative AI algorithms. This method will almost certainly lead to improvements in large context windows in LLMs and better inference performance on modern GPU architectures. Impressive progress!
🔎 ML Research
FlastAttention-3
A group of AI researchers from Meta, Princeton University, Together AI, NVIDIA and others published a paper unveiling the new version of the famous FlastAttention algorithm. FlashAttention-3 takes advantages of the latest GPU advancements achieving 2x the performance of its predecessor and also exceling in long context LLM tasks —> Read more.
Sub-Billion Parameter Models for Mobile
Meta AI published a paper introducing MobileLLM, a sub-billion parameter model optimized for on-device scenarios. MobileLLM uses a specific structure of embedding and attention layers that optimizes its efficiency relative to its size —> Read more.
Generative Teaching for Agents
Microsoft Research published a paper unveiling AgentInstruct, an agentic framework for creating syntethic data. Specifically, AgentInstruct focuses on datasets used for instruction tuning of base models —> Read more.
Evaluating Multimodal Foundation Models
Researchers from Carnegie Mellon University published a paper introducing the holitic evaluation of multimodal models(HEMM) framework . HEMM sets the primitives to systematically evaluate multimodal models across different tasks such as basic skills, information flow, and real-world use cases —> Read more.
A Unified AI Database
Microsoft Research published a paper proposing VBase, the foundation for a unified database for vector, relational and scalar data types. The core of VBase is based on a property called relaxed monotonicity that enables the unification of the different data types models —> Read more.
Contamination in Code Generation Benchmarks
Researchers from Cohere published a paper providing evidence of the levels of contamination of code generation benchmarks in major LLMs. The paper also proposes a Less Basic Python Problems, a new benchmark more resilient to contamination —> Read more.
Autoregressive Models for Text-Image Generation
The team bedhind the Generative AI Research Lab(GAIR) published a paper unveileing ANOLE, an autoregressive multimodal model for image and text generation. ANOLE is based on Meta AI’s Chameleon which guarantees a data and parameter efficient fine-tuning strategy —> Read more.
🤖 Cool AI Tech Releases
Claude High Quality Prompts
Anthropic released some features to evaluate and generate high quality prompts for Claude —> Read more.
MInference
Microsoft released some demos of its MInference method for optimizing LLM inference performance —> Read more.
AutoGen Models
Microsoft AutoGen added support for non OpenAI models —> Read more.
🛠 Real World AI
Ad Inference at Meta
Meta shares some details about the AI inference architecture powering its ad serving system —> Read more.
📡AI Radar
Hebbia, a platform that uses AI to analyze large documents, raised $130 million in new funding.
OpenAI and Los Alamos National Laboraty announced a strategic alliance for bioscience research.
Defense AI startup Helsing raised $487 million to expand to countries neighboring Russia.
AI video startup Captions raised a $60 million Series C.
Hayden AI, an AI vision platform for smart cities, raised $90 million in a new round.
NeuralFabric, a platform focused on micro-foundation models, unveiled a new small LLM for sustainability.
Fireworks AI raised $52 million to lead the shift to compound AI systems.
OpenAI and Arianna Huffington launched Thrive AI, a new AI health coach.
Groq unveiled new performance improvements to its fast inference LLM-engine
Amazon released a standalone Guardrails API in its Bedrock platform.
Enterprise AI startup Writer unveiled an impressive set of capabilities .
Microsoft and Apple dropped their plans to join the OpenAI board.
Amazon announced a new challenge to advance coding LLMs.
Exein announced a $15 mission Series B for robotic security.
Medal raised $13 million at a $333 million valuation to build a contextual AI assistant.
AI construction startup Buildots raised $15 million from Intel.