Quantize GPT-2

Convert GPT-2 to int8 and check performance
Published

February 21, 2021

I’ve quantized resnet18 a couple of times now. Lets see if I can apply the same techniques to a larger model, in this case GPT-2. I happen to know that there is a problem case with this that will be interesting to investigate.

The problem case is that the GPT-2 model can take the previous output of the model (the “past”) to allow it to process larger input sequences. Correctly exporting the model to ONNX will involve handling this optional input in some way.

The first thing to do is just to get the model working at all. To start with I am using the transformers example.

Code
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
Code
sequence = f"Hugging Face is based in DUMBO, New York City, and "

input_ids = tokenizer.encode(sequence, return_tensors="pt")

# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
Hugging Face is based in DUMBO, New York City, and  

So this isn’t working? The example they show generates the prediction “has”. Maybe getting more than the top token would help.

Code
next_token = torch.multinomial(probs, num_samples=10).reshape(10, 1)
generated = torch.cat([input_ids.expand((10, input_ids.shape[1])), next_token], dim=-1)
print(
    "\n".join(
        tokenizer.decode(tokens)
        for tokens in generated.tolist()
    )
)
Hugging Face is based in DUMBO, New York City, and  
Hugging Face is based in DUMBO, New York City, and urs
Hugging Face is based in DUMBO, New York City, and iz
Hugging Face is based in DUMBO, New York City, and iph
Hugging Face is based in DUMBO, New York City, and ________
Hugging Face is based in DUMBO, New York City, and �
Hugging Face is based in DUMBO, New York City, and ik
Hugging Face is based in DUMBO, New York City, and 【
Hugging Face is based in DUMBO, New York City, and �
Hugging Face is based in DUMBO, New York City, and ia

This looks like a complete mess. I wonder what it would look like if I generated several tokens instead of just one.

Code
sequence = f"Hugging Face is based in DUMBO, New York City, and "

input_ids = tokenizer.encode(sequence, return_tensors="pt")

for _ in range(10):
    next_token_logits = model(input_ids).logits[:, -1, :]
    filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
    probs = F.softmax(filtered_next_token_logits, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1)
    input_ids = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(input_ids.tolist()[0])
print(resulting_string)
Hugging Face is based in DUMBO, New York City, and  is available every Saturday through Sunday.  I

That’s better. I wonder if the special tokens are interfering with the extension of the input?

Code
torch.all(
    torch.eq(
        tokenizer.encode(sequence, return_tensors="pt"),
        tokenizer.encode(sequence, return_tensors="pt", add_special_tokens=False)
    )
)
tensor(True)

So it’s not that. Very strange.


Export to ONNX

Anyway the next thing to investigate is converting this to onnx and then quantizing it. This involves a lot of hacking. If you want to do this properly then use the current master version of onnx runtime, and run the benchmark_gpt2 script.

Code
model.config
GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 50257
}

The n_positions value (1024) in this configuration is the maximum sequence length (you can see this in the configuration details here). Since I want to be able to pass in tensors up to this sequence I should make sure that the export includes at least this length of input.

Code
input_ids.shape
torch.Size([1, 26])
Code
from pathlib import Path

DATASET_FOLDER = Path(".") / "data" / "2021-02-21-quantize-gpt-2"
Code
dummy_input = (torch.rand(1, model.config.n_positions) * model.config.vocab_size).long()
torch.onnx.export(
    model,
    dummy_input,
    DATASET_FOLDER / "model.onnx",
    input_names=['input_tokens'],
    opset_version=12,
)

So this failed several times. I had to define and increase the opset_version until it passed. I wonder if there is a way to enumerate the current opset versions?

After investigating the source code I can find _onnx_stable_opsets = [7, 8, 9, 10, 11, 12] in torch.onnx.symbolic_helper. So I’m using the very latest version!

Now to run it…

Code
import onnxruntime as ort

onnx_session = ort.InferenceSession(str(DATASET_FOLDER / "model.onnx"))

onnx_output = onnx_session.run(
    None,
    {"input_tokens": dummy_input.numpy()}
)
Code
len(onnx_output)
25
Code
onnx_output[0].shape
(1, 1024, 50257)
Code
with torch.no_grad():
    torch_output = model(dummy_input).logits
torch_output.shape
torch.Size([1, 1024, 50257])
Code
(torch_output - onnx_output[0]).abs().mean(), (torch_output - onnx_output[0]).abs().max()
(tensor(1.9999e-05), tensor(0.0001))

So that’s pretty close. The next thing is to check that it can handle the variable length inputs, like the original input sentence we tried before.

Code
sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="pt")

onnx_output = onnx_session.run(
    None,
    {"input_tokens": input_ids.numpy()}
)
InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input_tokens for the following indices
 index: 1 Got: 26 Expected: 1024
 Please fix either the inputs or the model.

So the export of the model made that input fixed size. I wonder if that can be fixed?


Use existing Exporter Script

As it happens I know that ONNX have some code to help with all this, so insead of investigating this problem deeply I am just going to use theirs! You can see their benchmark gpt2 code here.

Code
from transformers import AutoConfig
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MODEL_CLASSES, PRETRAINED_GPT2_MODELS

config = AutoConfig.from_pretrained("gpt2")
model = MODEL_CLASSES["GPT2LMHeadModel"][0].from_pretrained(
    "gpt2",
    config=config,
)

Gpt2Helper.export_onnx(
    model=model,
    device="cpu",
    onnx_model_path=str(DATASET_FOLDER / "model.onnx"),
    use_external_data_format=config.n_layer > 24
)
TypeError: forward() got an unexpected keyword argument 'past'

So this is interesting, it’s a currently breaking change in transformers 4.0 see bug on github. The fix for this was merged however it’s not in the current release (1.6.0). So for now I guess I would have to get the gpt2_helper.py from master and patch my version?

The gpt2 models are wrapped in decorators defined in gpt2_helper.py. If I change that then I can correctly pass the past parameter into the model.

Code
from transformers import GPT2LMHeadModel

class MyGPT2LMHeadModel(GPT2LMHeadModel):
    """ Here we wrap a class for Onnx model conversion for GPT2LMHeadModel with past state.
    """
    def __init__(self, config):
        super().__init__(config)

    def forward(self, input_ids, position_ids, attention_mask, *past):
        return super().forward(input_ids,
                               position_ids=position_ids,
                               attention_mask=attention_mask,
                               past_key_values=past)
Code
from transformers import AutoConfig
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MODEL_CLASSES, PRETRAINED_GPT2_MODELS

config = AutoConfig.from_pretrained("gpt2")
model = MyGPT2LMHeadModel.from_pretrained(
    "gpt2",
    config=config,
)

Gpt2Helper.export_onnx(
    model=model,
    device="cpu",
    onnx_model_path=str(DATASET_FOLDER / "model.onnx"),
    use_external_data_format=config.n_layer > 24
)
AttributeError: 'tuple' object has no attribute 'shape'

Now it doesn’t like the gpt2_helper code I guess? It’s either the inputs or the outputs that are problematic.

Code
dummy_inputs = Gpt2Helper.get_dummy_inputs(
    batch_size=1,
    past_sequence_length=1,
    sequence_length=1,
    num_attention_heads=config.num_attention_heads,
    hidden_size=config.hidden_size,
    num_layer=config.n_layer,
    vocab_size=config.vocab_size,
    device="cpu",
    float16=False,
    has_position_ids=True,
    has_attention_mask=True
)

with torch.no_grad():
    outputs = model(*dummy_inputs.to_list())

type(outputs[1][0])
tuple

So this doesn’t like the return value of the model. That may’ve changed too.

Code
len(outputs[1])
12

Is it returning the output of each layer of the model perhaps? It looks like the output has too many layers for the onnx helper script, so unwrapping one would be appropriate.

Code
from transformers import GPT2LMHeadModel

class MyGPT2LMHeadModel(GPT2LMHeadModel):
    """ Here we wrap a class for Onnx model conversion for GPT2LMHeadModel with past state.
    """
    def __init__(self, config):
        super().__init__(config)

    def forward(self, input_ids, position_ids, attention_mask, *past):
        outputs = super().forward(input_ids,
                               position_ids=position_ids,
                               attention_mask=attention_mask,
                               past_key_values=past)
        # flatten the past_key_values
        outputs.past_key_values = tuple(
            out[0]
            for out in outputs.past_key_values
        )
        return outputs
Code
from transformers import AutoConfig
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MODEL_CLASSES, PRETRAINED_GPT2_MODELS

config = AutoConfig.from_pretrained("gpt2")
model = MyGPT2LMHeadModel.from_pretrained(
    "gpt2",
    config=config,
)

Gpt2Helper.export_onnx(
    model=model,
    device="cpu",
    onnx_model_path=str(DATASET_FOLDER / "model.onnx"),
    use_external_data_format=config.n_layer > 24
)

So that is just a whole load of hacking around. I’m not confident that this approach is correct - the best way would be to compile onnxruntime from source and use the patched code.


Evaluate ONNX GPT-2

Now it’s time to see how closely the exported version matches the transformers model. Remember that even though the model is a wrapper defined in this notebook, the base model is still from transformers.

To prepare the data for the onnx_session I’m using code from the Gpt2Helper.onnxruntime_inference method.

Code
import onnxruntime as ort
import numpy as np

onnx_session = ort.InferenceSession(str(DATASET_FOLDER / "model.onnx"))
dummy_inputs = Gpt2Helper.get_dummy_inputs(
    batch_size=1,
    past_sequence_length=1,
    sequence_length=1,
    num_attention_heads=config.num_attention_heads,
    hidden_size=config.hidden_size,
    num_layer=config.n_layer,
    vocab_size=config.vocab_size,
    device="cpu",
    float16=False,
    has_position_ids=True,
    has_attention_mask=True
)

# generates the full input, copied from the onnxruntime code
# see it in onnxruntime_inference
# https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/gpt2_helper.py#L359
ort_inputs = {'input_ids': np.ascontiguousarray(dummy_inputs.input_ids.cpu().numpy())}

if dummy_inputs.past is not None:
    for i, past_i in enumerate(dummy_inputs.past):
        ort_inputs[f'past_{i}'] = np.ascontiguousarray(past_i.cpu().numpy())

if dummy_inputs.attention_mask is not None:
    ort_inputs['attention_mask'] = np.ascontiguousarray(dummy_inputs.attention_mask.cpu().numpy())

if dummy_inputs.position_ids is not None:
    ort_inputs['position_ids'] = np.ascontiguousarray(dummy_inputs.position_ids.cpu().numpy())

onnx_output = onnx_session.run(
    None,
    ort_inputs
)
onnx_output[0].shape
(1, 1, 50257)
Code
torch_output = model(
    dummy_inputs.input_ids,
    dummy_inputs.position_ids,
    dummy_inputs.attention_mask,
    *dummy_inputs.past
)
torch_output.logits.shape
torch.Size([1, 1, 50257])
Code
with torch.no_grad():
    print(f"mean absolute difference: {(torch_output.logits - onnx_output[0]).abs().mean()}")
    print(f"max absolute difference: {(torch_output.logits - onnx_output[0]).abs().max()}")
mean absolute difference: 2.7120806862512836e-06
max absolute difference: 1.52587890625e-05

So this is pretty encouraging? It’s successfully exported the model and there is a very low variance with the transformers model.


Quantizing GPT-2

The last thing to do is quantizing GPT-2. Hopefully this can retain the close accuracy of the ONNX version.

Code
from onnxruntime.quantization import quantize_qat

quantize_qat(str(DATASET_FOLDER / "model.onnx"), str(DATASET_FOLDER / "model.qat.onnx"))
Code
onnx_session = ort.InferenceSession(str(DATASET_FOLDER / "model.qat.onnx"))

onnx_output = onnx_session.run(
    None,
    ort_inputs
)

with torch.no_grad():
    print(f"mean absolute difference: {(torch_output.logits - onnx_output[0]).abs().mean()}")
    print(f"max absolute difference: {(torch_output.logits - onnx_output[0]).abs().max()}")
    print(f"torch mean: {torch_output.logits.mean()}, torch standard deviation: {torch_output.logits.std()}")
    print(f"onnx mean: {onnx_output[0].mean()}, onnx standard deviation: {onnx_output[0].std()}")
mean absolute difference: 1.2408978939056396
max absolute difference: 1.8664531707763672
torch mean: -18.513174057006836, torch standard deviation: 1.3360968828201294
onnx mean: -17.272275924682617, onnx standard deviation: 1.3170347213745117

So it quantized however this difference is massive. Does it actually change the prediction?

Code
torch_output.logits.argmax(), onnx_output[0].argmax()
(tensor(13), 11)

It does. I’ve seen cases for a similar difference where it does not. I would have to evaluate the consistency far more to have confidence in the quantized version.

Still, this is a running version of GPT-2 that has been quantized with support for the past. That’s a pretty big achievement. This has cut the size of the model on disk from 635M to 168M which is nice. It should be pretty fast too.


Performance

Let’s see how fast the quantized version is then. This initial test is done with unrealistically small input sizes.

Code
%%timeit

onnx_output = onnx_session.run(
    None,
    ort_inputs
)
4.8 ms ± 5.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Code
%%timeit

torch_output = model(
    dummy_inputs.input_ids,
    dummy_inputs.position_ids,
    dummy_inputs.attention_mask,
    *dummy_inputs.past
)
22.7 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Code
model.cuda()

dummy_inputs = Gpt2Helper.get_dummy_inputs(
    batch_size=1,
    past_sequence_length=1,
    sequence_length=1,
    num_attention_heads=config.num_attention_heads,
    hidden_size=config.hidden_size,
    num_layer=config.n_layer,
    vocab_size=config.vocab_size,
    device="cuda",
    float16=False,
    has_position_ids=True,
    has_attention_mask=True
)

Both of these models are on CPU, so this is a solid 4x speedup.

Code
%%timeit

torch_output = model(
    dummy_inputs.input_ids,
    dummy_inputs.position_ids,
    dummy_inputs.attention_mask,
    *dummy_inputs.past
)
9.48 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It’s even faster than the GPU version? That’s neat. I guess this is down to the GPU parallelizing better, but starting off worse.


Performance Revisited

So the dummy input is extremely small. I should make it bigger to get a more honest comparison.

Code
dummy_inputs = Gpt2Helper.get_dummy_inputs(
    # embiggen
    batch_size=16,
    past_sequence_length=512,
    sequence_length=512,

    num_attention_heads=config.num_attention_heads,
    hidden_size=config.hidden_size,
    num_layer=config.n_layer,
    vocab_size=config.vocab_size,
    device="cpu",
    float16=False,
    has_position_ids=True,
    has_attention_mask=True
)

ort_inputs = {'input_ids': np.ascontiguousarray(dummy_inputs.input_ids.cpu().numpy())}

if dummy_inputs.past is not None:
    for i, past_i in enumerate(dummy_inputs.past):
        ort_inputs[f'past_{i}'] = np.ascontiguousarray(past_i.cpu().numpy())

if dummy_inputs.attention_mask is not None:
    ort_inputs['attention_mask'] = np.ascontiguousarray(dummy_inputs.attention_mask.cpu().numpy())

if dummy_inputs.position_ids is not None:
    ort_inputs['position_ids'] = np.ascontiguousarray(dummy_inputs.position_ids.cpu().numpy())
Code
%%timeit

onnx_output = onnx_session.run(
    None,
    ort_inputs
)
7.1 s ± 81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Code
model.cpu()
Code
%%timeit

torch_output = model(
    dummy_inputs.input_ids,
    dummy_inputs.position_ids,
    dummy_inputs.attention_mask,
    *dummy_inputs.past
)
5.95 s ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I can’t actually run this on my GPU as it’s too big now. Given that the pytorch cpu version has pulled ahead I don’t think the performance is incredible anymore.