Textbooks are All You Need: How Microsoft's Phi-1 Outperformed Larger Code Language Models
The secret was in the quality of the fine-tune dataset
Coding has been one of the most active areas of development in the foundation model space. OpenAI opened the floodgates to this space with models like Codex, which eventually morphed into GPT-4. However, companies such as Amazon and Salesforce have also released incredibly high-quality work in this domain. The premise of coding foundation models has been the ability to pre-train a model in a large number of code datasets and expect capabilities to surface across different programming languages. Quantity and size over quality has been the mantra of the first generation of coding language models. Recently, Microsoft Research published a paper with a catchy title: “Textbooks is all You Need” that challenged this assumption by creating a small coding language model trained solely in textbook quality datasets. The paper immediately became super popular within the LLM community given its unique approach to LLM training producing a model that was significatively smaller but equally performant than alternatives.