Quantize PyTorch Models

Convert floating point models into int8 models
Published

February 1, 2021

I’ve recently read this blog post about quantizing huggingface models and I thought it would be good to test my understanding. What I want to do now is to quantize and run a resnet model. I’ve downloaded imagenet recently so I have plenty of data to test it with.

The process that I am interested in is to export the model to ONNX and then quantize that model. It should be possible to quantize the input easily as the image is already represented as integers. This might be interesting though as normalization will need to be pushed into the model.

Then a comparison of the performance of the original model to the new one can be made.

I can start with a pytorch resnet model. There is a nice document about exporting to onnx. Lets try that first.

Code
from pathlib import Path
import torch
Code
model = torch.hub.load('pytorch/vision:v0.6.0', 'resnet18', pretrained=True)
model.eval() ; None
Using cache found in /home/matthew/.cache/torch/hub/pytorch_vision_v0.6.0
Code
MODEL_FILE = Path(".").resolve() / "data" / "2021-02-01-quantize-pytorch-models" / "resnet18.onnx"
MODEL_FILE.parent.mkdir(exist_ok=True, parents=True)
Code
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, MODEL_FILE, input_names=['input_image'])

So exporting it seems straightforward. I now need to show that it is equivalent to the pytorch version. To establish equivalence I should load it from onnx, ideally into another framework.

Code
import onnxruntime as ort

ort_sess = ort.InferenceSession(MODEL_FILE)
outputs = ort_sess.run(None, dummy_input.numpy())
print(outputs)
OSError: libcudnn.so.8: cannot open shared object file: No such file or directory

That’s not great. Onnx runtime requires CUDA 10.2, and I have CUDA 11.0. I’m going to downgrade to Onnx runtime cpu to skip this.

If this was a serious evaluation then it would be possible to build the runtime from source to target 11.0.

Code
import onnxruntime as ort

ort_sess = ort.InferenceSession(str(MODEL_FILE))

onnx_outputs = ort_sess.run(None, {"input_image": dummy_input.numpy()})
onnx_output = onnx_outputs[0]
print(onnx_output.shape)
(1, 1000)
Code
with torch.no_grad():
    torch_output = model(dummy_input).numpy()
print(torch_output.shape)
(1, 1000)
Code
import numpy as np

mean_difference = np.absolute(onnx_output - torch_output).mean()
mean_onnx_output = np.absolute(onnx_output).mean()
mean_torch_output = np.absolute(torch_output).mean()

print(f"The mean absolute difference is:      {mean_difference:0.9f}")
print(f"The mean value of the onnx model is:  {mean_onnx_output:0.9f}")
print(f"The mean value of the torch model is: {mean_torch_output:0.9f}")
The mean absolute difference is:      0.000000982
The mean value of the onnx model is:  1.143334746
The mean value of the torch model is: 1.143334746

So the difference is about \(\frac{10}{1,000,000}\) per entry. This is run with the same input that the model was exported with, so it might be doing particularly well with this. Lets try generating a different input and comparing again.

Code
new_input = torch.randn(1, 3, 224, 224)

onnx_outputs = ort_sess.run(None, {"input_image": new_input.numpy()})
onnx_output = onnx_outputs[0]

with torch.no_grad():
    torch_output = model(new_input).numpy()

mean_difference = np.absolute(onnx_output - torch_output).mean()
mean_onnx_output = np.absolute(onnx_output).mean()
mean_torch_output = np.absolute(torch_output).mean()

print(f"The mean absolute difference is:      {mean_difference:0.9f}")
print(f"The mean value of the onnx model is:  {mean_onnx_output:0.9f}")
print(f"The mean value of the torch model is: {mean_torch_output:0.9f}")
The mean absolute difference is:      0.000001021
The mean value of the onnx model is:  1.221439719
The mean value of the torch model is: 1.221439958

The difference is still about \(\frac{10}{1,000,000}\) per entry. I’m not very surprised about this as resnet has no discrete parts and makes no decisions so the translation to ONNX format through inspecting the operations for the given input should be faithful. It’s good to see that it has not somehow encoded the dummy input in the model though.


So now lets try to quantize the model. There are some simple instructions here.

Code
QUANTIZED_DYNAMIC_FILE = MODEL_FILE.parent / "resnet18.dynamic.onnx"
QUANTIZED_QAT_FILE = MODEL_FILE.parent / "resnet18.qat.onnx"
Code
from onnxruntime.quantization import quantize_dynamic, quantize_qat, QuantType

dynamic_model = quantize_dynamic(str(MODEL_FILE), str(QUANTIZED_DYNAMIC_FILE), weight_type=QuantType.QUInt8)
qat_model = quantize_qat(str(MODEL_FILE), str(QUANTIZED_QAT_FILE))
Warning: The original model opset version is 9, which does not support quantization. Please update the model to opset >= 11. Updating the model automatically to opset 11. Please verify the quantized model.
Warning: The original model opset version is 9, which does not support quantization. Please update the model to opset >= 11. Updating the model automatically to opset 11. Please verify the quantized model.
Code
ort_dynamic_sess = ort.InferenceSession(str(QUANTIZED_DYNAMIC_FILE))

dynamic_outputs = ort_dynamic_sess.run(None, {"input_image": dummy_input.numpy()})
dynamic_output = dynamic_outputs[0]
print(dynamic_output.shape)
(1, 1000)
Code
dynamic_output.dtype
dtype('float32')
Code
ort_qat_sess = ort.InferenceSession(str(QUANTIZED_DYNAMIC_FILE))

qat_outputs = ort_qat_sess.run(None, {"input_image": dummy_input.numpy()})
qat_output = qat_outputs[0]
print(qat_output.shape)
(1, 1000)
Code
qat_output.dtype
dtype('float32')
Code
with torch.no_grad():
    torch_output = model(dummy_input).numpy()

for name, output in [("dynamic", dynamic_output), ("qat", qat_output)]:
    mean_difference = np.absolute(output - torch_output).mean()
    mean_onnx_output = np.absolute(output).mean()

    print(f"The mean absolute difference is:         {mean_difference:0.9f}")
    print(f"The mean value of the {name: <7} model is:  {mean_onnx_output:0.9f}")

mean_torch_output = np.absolute(torch_output).mean()
print(f"The mean value of the torch model is:    {mean_torch_output:0.9f}")
The mean absolute difference is:         0.128909871
The mean value of the dynamic model is:  1.163000822
The mean absolute difference is:         0.128909871
The mean value of the qat     model is:  1.163000822
The mean value of the torch model is:    1.143334746

I’m really impressed with how easy it was to use the quantized models. The difference is now dramatically bigger - about \(\frac{1}{10}\) per entry. This is big enough to seriously affect the accuracy of the quantized model.

For reference the quantization has cut the size down to \(\frac{1}{4}\) of the original. Resnet18 is very small though, so that’s not a big deal. For a larger model the saving would be great.


For a final evaluation lets see how the CPU onnx quantized model performs against the GPU torch model in speed. I should also look to see how often they disagree in their outputs. Since resnet is a single label classifier I can just take the argmax index of both outputs.

Code
%%timeit

with torch.no_grad():
    model(dummy_input)
13 ms ± 356 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I’m not sure this is on the GPU!

Code
model.cuda()
cuda_input = dummy_input.clone().cuda()
Code
%%timeit

with torch.no_grad():
    model(cuda_input)
1.89 ms ± 842 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So now we can compare this to the onnx models.

Code
numpy_input = dummy_input.numpy()
Code
%%timeit

ort_dynamic_sess.run(None, {"input_image": numpy_input})
8.79 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Code
%%timeit

ort_qat_sess.run(None, {"input_image": numpy_input})
8.47 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So it’s about 4.5x slower than the GPU pytorch version, and about 1.5x faster than the CPU pytorch version. This seems to be a pretty good spot check.

Code
def is_equal(torch_model, onnx_session) -> bool:
    model_input = torch.randn(1, 3, 224, 224)
    with torch.no_grad():
        torch_output = model(new_input).numpy().argmax(axis=-1)
    onnx_output = onnx_session.run(None, {"input_image": model_input.numpy()})[0].argmax(axis=-1)
    
    return np.array_equal(torch_output, onnx_output)
Code
dynamic_matching = np.array([
    is_equal(model, ort_dynamic_sess)
    for _ in range(1_000)
])
dynamic_matching.sum() / 1_000
0.99
Code
qat_matching = np.array([
    is_equal(model, ort_qat_sess)
    for _ in range(1_000)
])
qat_matching.sum() / 1_000
0.998

A 1% difference seems pretty acceptable to me. This evaluation makes quantization look awesome!