Member-only story

What is Quantization in Deep Learning: Recipe to Making Deep Learning Models Faster in Production

Comprehensive Guide to Quantization Methods Used in Deep Learning

--

Introduction

Machine learning, specifically deep learning, has become increasingly popular in recent years. Thanks to advancements in hardware, companies like OpenAI can now host multi-billion parameter models to serve over 20 million people per day without any hassle. When people build super fancy models, infrastructure becomes more advanced, and vice versa.

Image source: https://pytorch.org/get-started/pytorch-2.0/

Large organizations like OpenAI can afford the multi-million-dollar infrastructure to serve their models, as the return on investment can be a few times higher.

Let's set aside the need for expensive infrastructure and consider a more straightforward scenario: you have a new transformer model and want to decrease its inference time. Additionally, you plan to deploy the model on low-memory devices. Currently, the model takes 200ms to perform inference on a GPU and requires 1.5 GB of memory.

Let's see how we can achieve this with a mechanism called quantization.

Fixed-point vs floating-point representation

Before discussing quantization logic, it is essential to understand the internal number representation used in deep learning models. As we all know, these models are built upon a combination of numbers and operations such as addition, multiplication, derivation, etc. We can significantly improve the overall efficiency and speed of the model by making these operations more efficient and fast.

But how? Well, there are several ways to do so. One way of doing this is by manipulating number representation. In computer science, there are various ways to represent real numbers, but here we will focus on two particular methods:

  1. Fixed-point representation: This representation has a fixed number of bits for the integer and fractional parts.
  2. Floating point representation: In this…

--

--

Responses (1)

Write a response