from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
= AutoModelForCausalLM.from_pretrained(
model "mosaicml/mpt-7b",
# init_device="meta",
="cuda",
init_device=True,
trust_remote_code=torch.float16,
torch_dtype
)
# The documentation explicitly states that this is the tokenizer that was used
# You can load the tokenizer with the name "mosaicml/mpt-7b"
= AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b') tokenizer
Recently Microsoft released the guidance library which allows you to template the output of a langauge model. The jsonformer library has also been created which allows you to restrict the output to a json schema. How do these two techniques compare?
In this post I am going to run through the examples for each of these libraries and then apply them to an extractive task.
Model
Throughout this I will be using a large language model. To make all of the comparisons fair I want to use the same model for both. The Open LLM Leaderboard shows that the mosaicml/mpt-7b
model is well performing for it’s size, so let’s try that.
We can quickly try the model out on a simple query. I’m trying it with the prompt that FastChat uses for the vicuna model, which has performed well for me before.
One thing to note is that I have to add the USER
token id to the stopping tokens as the model appears to continue to generate after the end of the output. Being able to restrict this generation using the two frameworks would be great.
Code
from transformers import GenerationConfig
import torch
with torch.inference_mode():
= tokenizer(
inputs """
A chat between a curious user and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: What is the capital of France?
ASSISTANT:""".strip(),
="pt",
return_tensors
)= inputs.to(model.device)
inputs
= GenerationConfig(
config =True,
early_stopping=20,
max_new_tokens=True,
do_sample=0.7,
temperature=1,
top_p=[
eos_token_id
tokenizer.eos_token_id,"USER").input_ids[0],
tokenizer(
],=tokenizer.eos_token_id,
pad_token_id
)= model.generate(
output **inputs,
=config,
generation_config
)
print(tokenizer.decode(output[0]))
A chat between a curious user and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: What is the capital of France?
ASSISTANT: Paris is the capital of France.
USER
It’s easy to use this model and this tiny test produces good results.
Microsoft Guidance Example
Now we can try out the json example from the Microsoft Guidance github page:
import guidance
# we use LLaMA here, but any GPT-style model will do
= guidance.llms.Transformers(
guidance_model =model,
model=tokenizer,
tokenizer="cuda",
device
)
# we can pre-define valid option sets
= ["sword", "axe", "mace", "spear", "bow", "crossbow"]
valid_weapons
# define the prompt
= guidance(
character_maker """The following is a character profile for an RPG game in JSON format.
```json
{
"id": "{{id}}",
"description": "{{description}}",
"name": "{{gen 'name' stop='"'}}",
"age": {{gen 'age' pattern='[0-9]+' stop=','}},
"armor": "{{#select 'armor'}}leather{{or}}chainmail{{or}}plate{{/select}}",
"weapon": "{{select 'weapon' options=valid_weapons}}",
"class": "{{gen 'class' stop='"'}}",
"mantra": "{{gen 'mantra' temperature=0.7 stop='"'}}",
"strength": {{gen 'strength' pattern='[0-9]+' stop=','}},
"items": [{{#geneach 'items' num_iterations=5 join=', '}}"{{gen 'this' temperature=0.7 stop='"'}}"{{/geneach}}]
}```"""
)
# generate a character
character_maker(id="e1f491f7-7ab8-4dac-8c20-c92b5e7d883d",
="A quick and nimble fighter.",
description=valid_weapons,
valid_weapons=guidance_model,
llm=False,
stream )
The following is a character profile for an RPG game in JSON format. ```json { "id": "e1f491f7-7ab8-4dac-8c20-c92b5e7d883d", "description": "A quick and nimble fighter.", "name": "Fighter", "age": 20, "armor": "leatherchainmailplateleather", "weapon": "sword", "class": "fighter", "mantra": "Death be to my enemies.", "strength": 10, "items": ["sword", "shield", "shortbow", "arrows", "leather armor"] }```
I’ve had to adjust this to add a stopping token to every open ended utterance. I think this relates to the inability of the model to correctly terminate during generation. It may be worth using the mosaicml/mpt-7b-instruct
variant of the model.
Jsonformer Example
Now we can try the same with Jsonformer.
from jsonformer import Jsonformer
= {
json_schema "type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"is_student": {"type": "boolean"},
"courses": {
"type": "array",
"items": {"type": "string"}
}
}
}
= "Generate a person's information based on the following schema:"
prompt = Jsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print(generated_data)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.11/lib/python3.11/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
RuntimeError: The expanded size of the tensor (50432) must match the existing size (50277) at non-singleton dimension 1. Target sizes: [1, 50432]. Tensor sizes: [50277]
The example has failed with a tensor size mismatch. I wonder why that is.
We can start by checking the size of the output that the model returns:
import torch
with torch.inference_mode():
= tokenizer(
input_ids "hello world",
="pt",
return_tensors=False,
return_attention_mask
).input_ids= input_ids.to(model.device)
input_ids = model(input_ids)
output output.logits.shape
torch.Size([1, 2, 50432])
We can see here that the model does return 50,432 separate values. I’ve reviewed the code of the OutputNumbersTokens
which was being used at the time of the failure, and it gets the vocab size by taking the length of the tokenizer. The core assumption is that this vocab size matches the size of the final dimension. So this should return 50,432:
len(tokenizer)
50277
The tokenizer does not match the model. This is a very strange.
I can get this working by adding junk tokens to the tokenizer until the size matches:
tokenizer.add_tokens([f"||xxxxxx-special-{id}-xxxxxx||"
for id in range(50432-50277)
])len(tokenizer)
50432
Now it should be possible to get the example working:
from jsonformer import Jsonformer
= {
json_schema "type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"is_student": {"type": "boolean"},
"courses": {
"type": "array",
"items": {"type": "string"}
}
}
}
= "Generate a person's information based on the following schema:"
prompt = Jsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print(generated_data)
{'name': 'John', 'age': 25.0, 'is_student': True, 'courses': ['Math']}
I think that the fault here lies with the model. The output from the model potentially could not be decoded with the tokenizer for the model!
Extractive Task
We have got both tools working, but how do they compare. One way to compare them would be to perform an extractive task, as that has a true value.
To start with we can try extracting the speaker details from a passage.
= """
prompt I will provide you with a passage from a book. You will extract details
of the speaker from that passage and return them to me encoded as json
data.
PASSAGE:
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world.
SPEAKER:
{"name": "Ishmael", "occupation": "sailor"}
PASSAGE:
My family have been prominent, well-to-do people in this Middle Western
city for three generations. The Carraways are something of a clan, and we
have a tradition that we’re descended from the Dukes of Buccleuch, but the
actual founder of my line was my grandfather’s brother, who came here in
fifty-one, sent a substitute to the Civil War, and started the wholesale
hardware business that my father carries on today.
SPEAKER:"""
import guidance
# we use LLaMA here, but any GPT-style model will do
= guidance.llms.Transformers(
guidance_model =model,
model=tokenizer,
tokenizer="cuda",
device
)
# define the prompt
= guidance(
speaker_describer """{{prompt}}
{"name": "{{gen 'name' stop='"'}}", "occupation": "{{gen 'occupation' stop='"'}}"}"""
)
# generate a character
speaker_describer(=prompt,
prompt=guidance_model,
llm=False,
stream )
I will provide you with a passage from a book. You will extract details of the speaker from that passage and return them to me encoded as json data. PASSAGE: Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. SPEAKER: {"name": "Ishmael", "occupation": "sailor"} PASSAGE: My family have been prominent, well-to-do people in this Middle Western city for three generations. The Carraways are something of a clan, and we have a tradition that we’re descended from the Dukes of Buccleuch, but the actual founder of my line was my grandfather’s brother, who came here in fifty-one, sent a substitute to the Civil War, and started the wholesale hardware business that my father carries on today. SPEAKER: {"name": "Mr. Carraway", "occupation": "hardware dealer"}
This is a great performance by guidance. It has correctly named the individual and made a reasonable guess at their occupation based on the information available. The passage comes from The Great Gatsby so we know that the name of the speaker is Nick Carraway, who was in the military and intends to study a new profession.
Let’s see how jsonformer performs.
from jsonformer import Jsonformer
= {
json_schema "type": "object",
"properties": {
"name": {"type": "string"},
"occupation": {"type": "string"},
}
}
= Jsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print(generated_data)
{'name': 'Ishmael', 'occupation': 'sailor'}
This has failed. Rather unfortunately it has chosen to repeat the example that was provided.
We can try another utterance where we want to extract an unknown number of things.
= """
prompt I will provide you with a passage from a book. You will list all locations
mentioned in that passage and return them to me encoded as json data.
PASSAGE:
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world.
SPEAKER:
{"locations": ["the shore", "the sea"]}
PASSAGE:
My family have been prominent, well-to-do people in this Middle Western
city for three generations. The Carraways are something of a clan, and we
have a tradition that we’re descended from the Dukes of Buccleuch, but the
actual founder of my line was my grandfather’s brother, who came here in
fifty-one, sent a substitute to the Civil War, and started the wholesale
hardware business that my father carries on today.
SPEAKER:"""
import guidance
# we use LLaMA here, but any GPT-style model will do
= guidance.llms.Transformers(
guidance_model =model,
model=tokenizer,
tokenizer="cuda",
device
)
# define the prompt
= guidance(
speaker_describer """{{prompt}}
{"locations": [{{#geneach 'items' join=', '}}"{{gen 'this' stop='"'}}"{{/geneach}}]}"""
)
# generate a character
speaker_describer(=prompt,
prompt=guidance_model,
llm=False,
stream )
I will provide you with a passage from a book. You will list all locations mentioned in that passage and return them to me encoded as json data. PASSAGE: Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. SPEAKER: {"locations": ["the shore", "the sea"]} PASSAGE: My family have been prominent, well-to-do people in this Middle Western city for three generations. The Carraways are something of a clan, and we have a tradition that we’re descended from the Dukes of Buccleuch, but the actual founder of my line was my grandfather’s brother, who came here in fifty-one, sent a substitute to the Civil War, and started the wholesale hardware business that my father carries on today. SPEAKER: {"locations": ["the city", "the Civil War", "the wholesale hardware business", "my father", "my grandfather", "my line", "my brother", "my grandfather’s brother", "fifty-one", "the Dukes of Buccleuch", "the Middle Western city", "the Dukes of Buccleuch", "the wholesale hardware business", "my father", "my grandfather", "my line", "my brother", "my grandfather’s brother", "fifty-one", "the Middle Western city", "the Civil War", "the wholesale hardware business", "my father", "my grandfather", "my line", "my brother", "my grandfather’s brother", "fifty-one", "the Middle Western city", "the Civil War", "the wholesale hardware business", "my father", "
KeyboardInterrupt:
I’ve had to stop this as it is endlessly generating the same responses. There is clearly a problem with getting the list generation to stop, so coming up with a variable number of entries is hard. We can try to get it to generate the underlying list itself using regex and stopping on the closing bracket.
import guidance
# we use LLaMA here, but any GPT-style model will do
= guidance.llms.Transformers(
guidance_model =model,
model=tokenizer,
tokenizer="cuda",
device
)
# define the prompt
= guidance(
speaker_describer """{{prompt}}
{"locations": ["{{gen 'this' stop=']' pattern='[A-Za-z0-9 ]+(", "[A-Za-z0-9 ]+)*"]'}}]}"""
)
# generate a character
speaker_describer(=prompt,
prompt=guidance_model,
llm=False,
stream )
I will provide you with a passage from a book. You will list all locations mentioned in that passage and return them to me encoded as json data. PASSAGE: Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. SPEAKER: {"locations": ["the shore", "the sea"]} PASSAGE: My family have been prominent, well-to-do people in this Middle Western city for three generations. The Carraways are something of a clan, and we have a tradition that we’re descended from the Dukes of Buccleuch, but the actual founder of my line was my grandfather’s brother, who came here in fifty-one, sent a substitute to the Civil War, and started the wholesale hardware business that my father carries on today. SPEAKER: {"locations": ["the city", "the Civil War"]}
This has done fairly well. It would’ve been nice if it could describe the city as the Middle Western city. Also the Civil War is not a location.
The use of the regular expression and stopping token worked quite well. Now it’s time to see how jsonformer fares.
from jsonformer import Jsonformer
= {
json_schema "type": "object",
"properties": {
"locations": {
"type": "array",
"items": {"type": "string"}
}
}
}
= Jsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print(generated_data)
{'locations': ['the shore', 'the sea']}
Once again this has repeated the example. I wonder if the jsonformer code itself is just extracting the example from the prompt and imagining that it was generated.
We could test this by capturing the raw output of generate and inspecting it afterwards:
= model.generate
original_generate = []
generated_tokens
def counting_generate(*args, **kwargs):
global generated_tokens
= original_generate(*args, **kwargs)
response
generated_tokens.append(response)return response
= counting_generate model.generate
from jsonformer import Jsonformer
= {
json_schema "type": "object",
"properties": {
"locations": {
"type": "array",
"items": {"type": "string"}
}
}
}
= Jsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print("number of generations: ", len(generated_tokens))
for tokens in generated_tokens:
= tokenizer.decode(tokens[0])
text = text.splitlines()[-1]
text print(text)
number of generations: 2
Result: {"locations": ["the shore",
Result: {"locations": ["the shore", "the sea"]
I’ve cut down the output to only show the final part. We can see that it is really generating this text, which is very odd. Remember that the model will produce the correct output if the prompt is right.
Is something being added that has broken the generation? Let’s see the full output for the last generation call:
Code
print(tokenizer.decode(generated_tokens[-1][0]))
I will provide you with a passage from a book. You will list all locations
mentioned in that passage and return them to me encoded as json data.
PASSAGE:
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world.
SPEAKER:
{"locations": ["the shore", "the sea"]}
PASSAGE:
My family have been prominent, well-to-do people in this Middle Western
city for three generations. The Carraways are something of a clan, and we
have a tradition that we’re descended from the Dukes of Buccleuch, but the
actual founder of my line was my grandfather’s brother, who came here in
fifty-one, sent a substitute to the Civil War, and started the wholesale
hardware business that my father carries on today.
SPEAKER:
Output result in the following JSON schema format:
{"type": "object", "properties": {"locations": {"type": "array", "items": {"type": "string"}}}}
Result: {"locations": ["the shore", "the sea"]
Yes, the generator is adding the prompt
Output result in the following JSON schema format:
{“type”: “object”, “properties”: {“locations”: {“type”: “array”, “items”: {“type”: “string”}}}}
Result:
It turns out that the Jsonformer class itself has added this in a method called get_prompt
:
Code
Jsonformer.get_prompt??
Signature: Jsonformer.get_prompt(self) Docstring: <no docstring> Source: def get_prompt(self): template = """{prompt}\nOutput result in the following JSON schema format:\n{schema}\nResult: {progress}""" progress = json.dumps(self.value) gen_marker_index = progress.find(f'"{self.generation_marker}"') if gen_marker_index != -1: progress = progress[:gen_marker_index] else: raise ValueError("Failed to find generation marker") prompt = template.format( prompt=self.prompt, schema=json.dumps(self.json_schema), progress=progress, ) return prompt File: ~/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.11/lib/python3.11/site-packages/jsonformer/main.py Type: function
To fix this we can subclass Jsonformer
itself and alter the get_prompt
method to return only the original prompt and the current generation progress:
from jsonformer import Jsonformer
import json
class UnpromptedJsonformer(Jsonformer):
def get_prompt(self):
= """{prompt}{progress}"""
template = json.dumps(self.value)
progress = progress.find(f'"{self.generation_marker}"')
gen_marker_index if gen_marker_index != -1:
= progress[:gen_marker_index]
progress else:
raise ValueError("Failed to find generation marker")
= template.format(
prompt =self.prompt,
prompt=progress,
progress
)
return prompt
With this we can try the two tasks again. Let’s start with The Great Gatsby speaker details:
Code
= """
prompt I will provide you with a passage from a book. You will extract details
of the speaker from that passage and return them to me encoded as json
data.
PASSAGE:
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world.
SPEAKER:
{"name": "Ishmael", "occupation": "sailor"}
PASSAGE:
My family have been prominent, well-to-do people in this Middle Western
city for three generations. The Carraways are something of a clan, and we
have a tradition that we’re descended from the Dukes of Buccleuch, but the
actual founder of my line was my grandfather’s brother, who came here in
fifty-one, sent a substitute to the Civil War, and started the wholesale
hardware business that my father carries on today.
SPEAKER:
"""
= {
json_schema "type": "object",
"properties": {
"name": {"type": "string"},
"occupation": {"type": "string"},
}
}
= UnpromptedJsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print(generated_data)
{'name': 'Mr. Carraway', 'occupation': 'hardware dealer'}
This now matches the output of guidance. The auto added prompt was the problem!
Now it’s time to try location extraction again. This is what I thought that Jsonformer would do well, as one of the examples is generation of an array of values. The guidance examples have a fixed iteration count and as you can see I had to put some work in to get it to generate a variable length list.
Code
= """
prompt I will provide you with a passage from a book. You will list all locations
mentioned in that passage and return them to me encoded as json data.
PASSAGE:
Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world.
SPEAKER:
{"locations": ["the shore", "the sea"]}
PASSAGE:
My family have been prominent, well-to-do people in this Middle Western
city for three generations. The Carraways are something of a clan, and we
have a tradition that we’re descended from the Dukes of Buccleuch, but the
actual founder of my line was my grandfather’s brother, who came here in
fifty-one, sent a substitute to the Civil War, and started the wholesale
hardware business that my father carries on today.
SPEAKER:
"""
= {
json_schema "type": "object",
"properties": {
"locations": {
"type": "array",
"items": {"type": "string"}
}
}
}
= UnpromptedJsonformer(model, tokenizer, json_schema, prompt)
jsonformer = jsonformer()
generated_data
print(generated_data)
{'locations': ['the city']}
This time it’s worked out slightly better as it hasn’t extracted the civil war as a location. Overall I would say that guidance has the edge over jsonformer, just because the code seems more resilient (even though the model is wrong). Jsonformer would be better if you had more direct control over the prompt.
I think that prompting is so critical to the success of the task that automatically adding prompts in is almost always a bad idea. It gets done because it makes a nice demo. WHen it comes to applying these tools it can get in the way.