What is Quantization in Deep Learning: Recipe to Making Deep Learning Models Faster in Production
Comprehensive Guide to Quantization Methods Used in Deep Learning
Introduction
Machine learning, specifically deep learning, has become increasingly popular in recent years. Thanks to advancements in hardware, companies like OpenAI can now host multi-billion parameter models to serve over 20 million people per day without any hassle. When people build super fancy models, infrastructure becomes more advanced, and vice versa.
Large organizations like OpenAI can afford the multi-million-dollar infrastructure to serve their models, as the return on investment can be a few times higher.
Let's set aside the need for expensive infrastructure and consider a more straightforward scenario: you have a new transformer model and want to decrease its inference time. Additionally, you plan to deploy the model on low-memory devices. Currently, the model takes 200ms to perform inference on a GPU and requires 1.5 GB of memory.