🐎 DistilGPT-2 model checkpoint

The student of the now ubiquitous GPT-2 does not come short of its teacher’s expectations. Obtained by distillation, DistilGPT-2 weighs 37% less, and is twice as fast as its OpenAI counterpart, while keeping the same generative power. Runs smoothly on an iPhone 7. The dawn of lightweight generative transformers? 🤯

From the paper: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter by Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf. The same method was applied to distill GPT-2, and a Medium blogpost describes the process in detail.