Code
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F
= AutoTokenizer.from_pretrained("gpt2")
tokenizer = AutoModelForCausalLM.from_pretrained("gpt2") model
February 21, 2021
I’ve quantized resnet18 a couple of times now. Lets see if I can apply the same techniques to a larger model, in this case GPT-2. I happen to know that there is a problem case with this that will be interesting to investigate.
The problem case is that the GPT-2 model can take the previous output of the model (the “past”) to allow it to process larger input sequences. Correctly exporting the model to ONNX will involve handling this optional input in some way.
The first thing to do is just to get the model working at all. To start with I am using the transformers example.
sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)
Hugging Face is based in DUMBO, New York City, and
So this isn’t working? The example they show generates the prediction “has”. Maybe getting more than the top token would help.
Hugging Face is based in DUMBO, New York City, and
Hugging Face is based in DUMBO, New York City, and urs
Hugging Face is based in DUMBO, New York City, and iz
Hugging Face is based in DUMBO, New York City, and iph
Hugging Face is based in DUMBO, New York City, and ________
Hugging Face is based in DUMBO, New York City, and �
Hugging Face is based in DUMBO, New York City, and ik
Hugging Face is based in DUMBO, New York City, and 【
Hugging Face is based in DUMBO, New York City, and �
Hugging Face is based in DUMBO, New York City, and ia
This looks like a complete mess. I wonder what it would look like if I generated several tokens instead of just one.
sequence = f"Hugging Face is based in DUMBO, New York City, and "
input_ids = tokenizer.encode(sequence, return_tensors="pt")
for _ in range(10):
next_token_logits = model(input_ids).logits[:, -1, :]
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
input_ids = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(input_ids.tolist()[0])
print(resulting_string)
Hugging Face is based in DUMBO, New York City, and is available every Saturday through Sunday. I
That’s better. I wonder if the special tokens are interfering with the extension of the input?
tensor(True)
So it’s not that. Very strange.
Anyway the next thing to investigate is converting this to onnx and then quantizing it. This involves a lot of hacking. If you want to do this properly then use the current master version of onnx runtime, and run the benchmark_gpt2 script.
GPT2Config {
"_name_or_path": "gpt2",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.3.2",
"use_cache": true,
"vocab_size": 50257
}
The n_positions
value (1024) in this configuration is the maximum sequence length (you can see this in the configuration details here). Since I want to be able to pass in tensors up to this sequence I should make sure that the export includes at least this length of input.
So this failed several times. I had to define and increase the opset_version
until it passed. I wonder if there is a way to enumerate the current opset versions?
After investigating the source code I can find _onnx_stable_opsets = [7, 8, 9, 10, 11, 12]
in torch.onnx.symbolic_helper
. So I’m using the very latest version!
Now to run it…
torch.Size([1, 1024, 50257])
(tensor(1.9999e-05), tensor(0.0001))
So that’s pretty close. The next thing is to check that it can handle the variable length inputs, like the original input sentence we tried before.
InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input_tokens for the following indices
index: 1 Got: 26 Expected: 1024
Please fix either the inputs or the model.
So the export of the model made that input fixed size. I wonder if that can be fixed?
As it happens I know that ONNX have some code to help with all this, so insead of investigating this problem deeply I am just going to use theirs! You can see their benchmark gpt2 code here.
from transformers import AutoConfig
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MODEL_CLASSES, PRETRAINED_GPT2_MODELS
config = AutoConfig.from_pretrained("gpt2")
model = MODEL_CLASSES["GPT2LMHeadModel"][0].from_pretrained(
"gpt2",
config=config,
)
Gpt2Helper.export_onnx(
model=model,
device="cpu",
onnx_model_path=str(DATASET_FOLDER / "model.onnx"),
use_external_data_format=config.n_layer > 24
)
TypeError: forward() got an unexpected keyword argument 'past'
So this is interesting, it’s a currently breaking change in transformers 4.0 see bug on github. The fix for this was merged however it’s not in the current release (1.6.0). So for now I guess I would have to get the gpt2_helper.py from master and patch my version?
The gpt2 models are wrapped in decorators defined in gpt2_helper.py. If I change that then I can correctly pass the past parameter into the model.
from transformers import GPT2LMHeadModel
class MyGPT2LMHeadModel(GPT2LMHeadModel):
""" Here we wrap a class for Onnx model conversion for GPT2LMHeadModel with past state.
"""
def __init__(self, config):
super().__init__(config)
def forward(self, input_ids, position_ids, attention_mask, *past):
return super().forward(input_ids,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_values=past)
from transformers import AutoConfig
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MODEL_CLASSES, PRETRAINED_GPT2_MODELS
config = AutoConfig.from_pretrained("gpt2")
model = MyGPT2LMHeadModel.from_pretrained(
"gpt2",
config=config,
)
Gpt2Helper.export_onnx(
model=model,
device="cpu",
onnx_model_path=str(DATASET_FOLDER / "model.onnx"),
use_external_data_format=config.n_layer > 24
)
AttributeError: 'tuple' object has no attribute 'shape'
Now it doesn’t like the gpt2_helper
code I guess? It’s either the inputs or the outputs that are problematic.
dummy_inputs = Gpt2Helper.get_dummy_inputs(
batch_size=1,
past_sequence_length=1,
sequence_length=1,
num_attention_heads=config.num_attention_heads,
hidden_size=config.hidden_size,
num_layer=config.n_layer,
vocab_size=config.vocab_size,
device="cpu",
float16=False,
has_position_ids=True,
has_attention_mask=True
)
with torch.no_grad():
outputs = model(*dummy_inputs.to_list())
type(outputs[1][0])
tuple
So this doesn’t like the return value of the model. That may’ve changed too.
Is it returning the output of each layer of the model perhaps? It looks like the output has too many layers for the onnx helper script, so unwrapping one would be appropriate.
from transformers import GPT2LMHeadModel
class MyGPT2LMHeadModel(GPT2LMHeadModel):
""" Here we wrap a class for Onnx model conversion for GPT2LMHeadModel with past state.
"""
def __init__(self, config):
super().__init__(config)
def forward(self, input_ids, position_ids, attention_mask, *past):
outputs = super().forward(input_ids,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_values=past)
# flatten the past_key_values
outputs.past_key_values = tuple(
out[0]
for out in outputs.past_key_values
)
return outputs
from transformers import AutoConfig
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MODEL_CLASSES, PRETRAINED_GPT2_MODELS
config = AutoConfig.from_pretrained("gpt2")
model = MyGPT2LMHeadModel.from_pretrained(
"gpt2",
config=config,
)
Gpt2Helper.export_onnx(
model=model,
device="cpu",
onnx_model_path=str(DATASET_FOLDER / "model.onnx"),
use_external_data_format=config.n_layer > 24
)
So that is just a whole load of hacking around. I’m not confident that this approach is correct - the best way would be to compile onnxruntime from source and use the patched code.
Now it’s time to see how closely the exported version matches the transformers model. Remember that even though the model is a wrapper defined in this notebook, the base model is still from transformers.
To prepare the data for the onnx_session I’m using code from the Gpt2Helper.onnxruntime_inference method.
import onnxruntime as ort
import numpy as np
onnx_session = ort.InferenceSession(str(DATASET_FOLDER / "model.onnx"))
dummy_inputs = Gpt2Helper.get_dummy_inputs(
batch_size=1,
past_sequence_length=1,
sequence_length=1,
num_attention_heads=config.num_attention_heads,
hidden_size=config.hidden_size,
num_layer=config.n_layer,
vocab_size=config.vocab_size,
device="cpu",
float16=False,
has_position_ids=True,
has_attention_mask=True
)
# generates the full input, copied from the onnxruntime code
# see it in onnxruntime_inference
# https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/gpt2_helper.py#L359
ort_inputs = {'input_ids': np.ascontiguousarray(dummy_inputs.input_ids.cpu().numpy())}
if dummy_inputs.past is not None:
for i, past_i in enumerate(dummy_inputs.past):
ort_inputs[f'past_{i}'] = np.ascontiguousarray(past_i.cpu().numpy())
if dummy_inputs.attention_mask is not None:
ort_inputs['attention_mask'] = np.ascontiguousarray(dummy_inputs.attention_mask.cpu().numpy())
if dummy_inputs.position_ids is not None:
ort_inputs['position_ids'] = np.ascontiguousarray(dummy_inputs.position_ids.cpu().numpy())
onnx_output = onnx_session.run(
None,
ort_inputs
)
onnx_output[0].shape
(1, 1, 50257)
torch.Size([1, 1, 50257])
mean absolute difference: 2.7120806862512836e-06
max absolute difference: 1.52587890625e-05
So this is pretty encouraging? It’s successfully exported the model and there is a very low variance with the transformers model.
The last thing to do is quantizing GPT-2. Hopefully this can retain the close accuracy of the ONNX version.
onnx_session = ort.InferenceSession(str(DATASET_FOLDER / "model.qat.onnx"))
onnx_output = onnx_session.run(
None,
ort_inputs
)
with torch.no_grad():
print(f"mean absolute difference: {(torch_output.logits - onnx_output[0]).abs().mean()}")
print(f"max absolute difference: {(torch_output.logits - onnx_output[0]).abs().max()}")
print(f"torch mean: {torch_output.logits.mean()}, torch standard deviation: {torch_output.logits.std()}")
print(f"onnx mean: {onnx_output[0].mean()}, onnx standard deviation: {onnx_output[0].std()}")
mean absolute difference: 1.2408978939056396
max absolute difference: 1.8664531707763672
torch mean: -18.513174057006836, torch standard deviation: 1.3360968828201294
onnx mean: -17.272275924682617, onnx standard deviation: 1.3170347213745117
So it quantized however this difference is massive. Does it actually change the prediction?
It does. I’ve seen cases for a similar difference where it does not. I would have to evaluate the consistency far more to have confidence in the quantized version.
Still, this is a running version of GPT-2 that has been quantized with support for the past. That’s a pretty big achievement. This has cut the size of the model on disk from 635M to 168M which is nice. It should be pretty fast too.
Let’s see how fast the quantized version is then. This initial test is done with unrealistically small input sizes.
4.8 ms ± 5.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
22.7 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
model.cuda()
dummy_inputs = Gpt2Helper.get_dummy_inputs(
batch_size=1,
past_sequence_length=1,
sequence_length=1,
num_attention_heads=config.num_attention_heads,
hidden_size=config.hidden_size,
num_layer=config.n_layer,
vocab_size=config.vocab_size,
device="cuda",
float16=False,
has_position_ids=True,
has_attention_mask=True
)
Both of these models are on CPU, so this is a solid 4x speedup.
9.48 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It’s even faster than the GPU version? That’s neat. I guess this is down to the GPU parallelizing better, but starting off worse.
So the dummy input is extremely small. I should make it bigger to get a more honest comparison.
dummy_inputs = Gpt2Helper.get_dummy_inputs(
# embiggen
batch_size=16,
past_sequence_length=512,
sequence_length=512,
num_attention_heads=config.num_attention_heads,
hidden_size=config.hidden_size,
num_layer=config.n_layer,
vocab_size=config.vocab_size,
device="cpu",
float16=False,
has_position_ids=True,
has_attention_mask=True
)
ort_inputs = {'input_ids': np.ascontiguousarray(dummy_inputs.input_ids.cpu().numpy())}
if dummy_inputs.past is not None:
for i, past_i in enumerate(dummy_inputs.past):
ort_inputs[f'past_{i}'] = np.ascontiguousarray(past_i.cpu().numpy())
if dummy_inputs.attention_mask is not None:
ort_inputs['attention_mask'] = np.ascontiguousarray(dummy_inputs.attention_mask.cpu().numpy())
if dummy_inputs.position_ids is not None:
ort_inputs['position_ids'] = np.ascontiguousarray(dummy_inputs.position_ids.cpu().numpy())
7.1 s ± 81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.95 s ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I can’t actually run this on my GPU as it’s too big now. Given that the pytorch cpu version has pulled ahead I don’t think the performance is incredible anymore.