In the current landscape of deep learning, optimizing models for environments with limited resources has become increasingly critical. Weight quantization addresses this need by reducing the precision of model parameters, generally from 32-bit floating point values to lower bit-width representations. This results in smaller models that operate more efficiently on resource-constrained hardware. This tutorial introduces the concept of weight quantization using PyTorch’s dynamic quantization technique on a pre-trained ResNet18 model. Throughout the tutorial, you will learn to inspect weight distributions, apply dynamic quantization to integral layers (like fully connected layers), compare model sizes, and visualize the effects. This tutorial aims to provide you with both theoretical insights and practical skills for deploying deep learning models.
import torch
import torch.nn as nn
import torch.quantization
import torchvision.models as models
import matplotlib.pyplot as plt
import numpy as np
import os
print("Torch version:", torch.__version__)
We begin by importing necessary libraries, including PyTorch, torchvision, and matplotlib, and print the PyTorch version to ensure that all essential modules are available for model manipulation and visualization.
model_fp32 = models.resnet18(pretrained=True)
model_fp32.eval()
print("Pretrained ResNet18 (FP32) model loaded.")
A pre-trained ResNet18 model is loaded in FP32 (floating-point) precision and set to evaluation mode, getting it ready for further processing and quantization.
fc_weights_fp32 = model_fp32.fc.weight.data.cpu().numpy().flatten()
plt.figure(figsize=(8, 4))
plt.hist(fc_weights_fp32, bins=50, color='skyblue', edgecolor='black')
plt.title("FP32 - FC Layer Weight Distribution")
plt.xlabel("Weight values")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
This block extracts and flattens the weights from the final fully connected layer of the FP32 model, then plots a histogram to visualize their distribution prior to quantization.
quantized_model = torch.quantization.quantize_dynamic(model_fp32, {nn.Linear}, dtype=torch.qint8)
quantized_model.eval()
print("Dynamic quantization applied to the model.")
Dynamic quantization is applied to the model, specifically targeting the Linear layers. This converts them to lower-precision formats, demonstrating a key technique for reducing model size and inference latency.
def get_model_size(model, filename="temp.p"):
torch.save(model.state_dict(), filename)
size = os.path.getsize(filename) / 1e6
os.remove(filename)
return size
fp32_size = get_model_size(model_fp32, "fp32_model.p")
quant_size = get_model_size(quantized_model, "quant_model.p")
print(f"FP32 Model Size: {fp32_size:.2f} MB")
print(f"Quantized Model Size: {quant_size:.2f} MB")
A function is defined to save and evaluate the model size on disk, which is then used to measure and compare the sizes of the original FP32 model and the quantized model, illustrating the compression effects of quantization.
<div class="dm-code-snippet" style="background-color:#abb8