What is Quantization in Deep Learning: Recipe to Making Deep Learning Models Faster in Production

Comprehensive Guide to Quantization Methods Used in Deep Learning

Ransaka Ravihara
7 min readMay 26, 2023

Introduction

Machine learning, specifically deep learning, has become increasingly popular in recent years. Thanks to advancements in hardware, companies like OpenAI can now host multi-billion parameter models to serve over 20 million people per day without any hassle. When people build super fancy models, infrastructure becomes more advanced, and vice versa.

Image source: https://pytorch.org/get-started/pytorch-2.0/

Large organizations like OpenAI can afford the multi-million-dollar infrastructure to serve their models, as the return on investment can be a few times higher.

Let's set aside the need for expensive infrastructure and consider a more straightforward scenario: you have a new transformer model and want to decrease its inference time. Additionally, you plan to deploy the model on low-memory devices. Currently, the model takes 200ms to perform inference on a GPU and requires 1.5 GB of memory.

--

--