Huggingface Agents

Checking out the new composition of deep learning models by Huggingface
Published

May 13, 2023

Huggingface have recently released Huggingface Agents which provide a natural language API on top of different tools. The example code invokes the hosted agents on huggingface itself. I like playing around with this stuff and it would be nice to be able to get it running locally.

Preamble

The example provides several different models that can be used. It also mentions that you need to log in to huggingface. Since that requires your access token I have this SUPER SECRET way of loading my API key without showing it in this blog.

I’ve created a file at ~/.config/huggingface/auth.json which has the access token in it!

Code
from pathlib import Path
import json
from huggingface_hub import login

CREDENTIAL_FILE = Path.home() / ".config" / "huggingface" / "auth.json"
API_KEY = json.loads(CREDENTIAL_FILE.read_text())["agent"]

login(API_KEY)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /home/matthew/.cache/huggingface/token
Login successful

You can see that the token has been saved to a local file. I had logged in earlier so the token was already cached. This meant that the agent requests were already working. I like these blog posts to be complete though so I covered this for completeness.

The other affordance is turning off the logging messages. CLIP is quite noisy about being updated and so a lot of messages would otherwise be shown.

Code
import logging

logging.getLogger().setLevel(logging.CRITICAL)

Example Usage

Let’s try out their sample query with one of the possible models. They specifically call out quality issues with these models:

StarCoder and OpenAssistant are free to use and perform admirably well on simple tasks. However, the checkpoints don’t hold up when handling more complex prompts. If you’re facing such an issue, we recommend trying out the OpenAI model which, while sadly not open-source, performs better at this given time.

I am not going to use OpenAI. This is an example of using the OpenAssistant model:

from transformers import HfAgent

agent = HfAgent(
    url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
)
agent.run("Draw me a picture of rivers and lakes.")
ValueError: not enough values to unpack (expected 2, got 1)

The model has apparently not produced output of the correct format. Let’s see if we can use the debugger to see what was actually produced.

%pdb
Automatic pdb calling has been turned ON
from transformers import HfAgent

agent = HfAgent(
    url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
)
agent.run("Draw me a picture of rivers and lakes.")
    168 def clean_code_for_run(result):
    169     result = f"I will use the following {result}"
--> 170     explanation, code = result.split("Answer:")
    171     explanation = explanation.strip()
    172     code = code.strip()

ipdb>  result

'I will use the following  tools:
`image_segmenter` to segment the image, 
`image_qa` to answer the question about the image, 
`document_captioner` to generate a caption,
`image_qa` to answer the question about the image,
`image_transform` to transform the image,
`image_segmenter` to segment the image,
`_qa` to answer,
`text_reader` to read the text,
`summarizer` to summarize the text,
`document_qa` to answer the question about the question about the image,
`image_generator` to generate an image according to caption,
`image_qa` to answer,
`text_to_downloader` to download the image,
`imageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimageimage`

ipdb>  c

You can see that the model wants to use multiple unrelated tools and then endlessly repeats the word image. This is broken. We can try the starcoder (Li et al. 2023) alternative for this query.

Li, Raymond, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, et al. 2023. “StarCoder: May the Source Be with You!” https://arxiv.org/abs/2305.06161.
from transformers import HfAgent

agent = HfAgent(
    url_endpoint="https://api-inference.huggingface.co/models/bigcode/starcoder"
)
agent.run("Draw me a picture of rivers and lakes.")
==Explanation from the agent==
I will use the following  tool: `image_segmenter` to create a segmentation mask of rivers and lakes.


==Code generated by the agent==
mask = image_segmenter(image, label="rivers and lakes")


==Result==
Evaluation of the code stopped at line 0 before the end because of the following error:
The variable `image` is not defined.

Now it has produced output that is the correct shape but the choice made by the model relies on a variable which does not yet exist. Since this is able to produce the correct style of output I feel like it’s worth trying again. We can try one of the other prompts from the examples:

image = agent.run(
    "Draw me a picture of the sea then transform the picture to add an island"
)
image.save("island.jpg")
==Explanation from the agent==
I will use the following  tools: `image_generator` to generate an image, then `image_transformer` to transform the image.


==Code generated by the agent==
image = image_generator(prompt="draw me a picture of the sea")
image = image_transformer(image=image, prompt="add an island")


==Result==

the picture of the island

The quality of this is quite varied, which is true for most generated art. This version is ok. I like how the alteration is being done by a hand with a pencil, meta!

More broadly we can see how this has worked (or at least, how it reported it works). The model that we interact with on huggingface is prompted to produce a selection of tools and code that uses them. This is then executed and the result is returned.

The next thing is to find the prompt that was provided to the model. We can get that if we just look at the agent:

Code
agent??
Type:        HfAgent
String form: <transformers.tools.agents.HfAgent object at 0x7f1d3c556b90>
File:        ~/.local/share/virtualenvs/blog-1tuLwbZm/lib/python3.10/site-packages/transformers/tools/agents.py
Source:     
class HfAgent(Agent):
    """
    Agent that uses and inference endpoint to generate code.
    Args:
        url_endpoint (`str`):
            The name of the url endpoint to use.
        token (`str`, *optional*):
            The token to use as HTTP bearer authorization for remote files. If unset, will use the token generated when
            running `huggingface-cli login` (stored in `~/.huggingface`).
        chat_prompt_template (`str`, *optional*):
            Pass along your own prompt if you want to override the default template for the `chat` method.
        run_prompt_template (`str`, *optional*):
            Pass along your own prompt if you want to override the default template for the `run` method.
        additional_tools ([`Tool`], list of tools or dictionary with tool values, *optional*):
            Any additional tools to include on top of the default ones. If you pass along a tool with the same name as
            one of the default tools, that default tool will be overridden.
    Example:
    ```py
    from transformers import HfAgent
    agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")
    agent.run("Is the following `text` (in Spanish) positive or negative?", text="¡Este es un API muy agradable!")
    ```
    """
    def __init__(
        self, url_endpoint, token=None, chat_prompt_template=None, run_prompt_template=None, additional_tools=None
    ):
        self.url_endpoint = url_endpoint
        if token is None:
            self.token = f"Bearer {HfFolder().get_token()}"
        elif token.startswith("Bearer") or token.startswith("Basic"):
            self.token = token
        else:
            self.token = f"Bearer {token}"
        super().__init__(
            chat_prompt_template=chat_prompt_template,
            run_prompt_template=run_prompt_template,
            additional_tools=additional_tools,
        )
    def generate_one(self, prompt, stop):
        headers = {"Authorization": self.token}
        inputs = {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 200, "return_full_text": False, "stop": stop},
        }
        response = requests.post(self.url_endpoint, json=inputs, headers=headers)
        if response.status_code == 429:
            print("Getting rate-limited, waiting a tiny bit before trying again.")
            time.sleep(1)
            return self._generate_one(prompt)
        elif response.status_code != 200:
            raise ValueError(f"Error {response.status_code}: {response.json()}")
        result = response.json()[0]["generated_text"]
        # Inference API returns the stop sequence
        for stop_seq in stop:
            if result.endswith(stop_seq):
                result = result[: -len(stop_seq)]
        return result

This entire agent is just one method, and that takes a prompt argument. It looks like we could patch that method to get the prompt.

Code
import textwrap
from pathlib import Path

provided_prompt = None
provided_stop = None

def patched_generate_one(prompt, stop):
    global provided_prompt, provided_stop
    provided_prompt = prompt
    provided_stop = stop
    raise Exception()

agent.generate_one = patched_generate_one

try:
    agent.run(
        "Draw me a picture of the sea then transform the picture to add an island"
    )
except:
    # we just threw to stop generation
    pass

Path("prompt.txt").write_text(
    "\n".join(
        line
        for paragraph in provided_prompt.split("\n")
        for line in textwrap.wrap(paragraph)
    )
) ; None

The prompt is quite long and you can read it here.

It is very clear and split into three sections. The first describes the overall task that the model has to perform (identify tools then use them in python code). A complete list of tools is then provided. Finally there are some examples of correct inputs and outputs.

What’s interesting to me is that this is a continuation style prompt, where the natural continuation of the text is used. This is in contrast to the instruction finetuned models where you can direct them to peform a task without including the start of the task as a primer.

When I look at the model I can see that it explicitly calls out this trait:

The model was trained on GitHub code. As such it is not an instruction model and commands like “Write a function that computes the square root.” do not work well. However, by using the Tech Assistant prompt you can turn it into a capable technical assistant.

The Tech Assistant prompt mentioned is a dataset used to refine the model. It has labelled inputs that form a dialog.

What is more interesting to me is that this is a 15.5B parameter model. That’s large and would be infeasible to run in full precision mode on my machine. I might be able to quantize it to get it down to 15.5G though.

As an aside I do want to call out the license agreement that you have to adhere to. I think it’s excellent, and while it is worth reading in it’s entirety this final clause stood out:

  1. [You may not use this model] For fully automated decision making in administration of justice, law enforcement, immigration or asylum processes.

Anyway before downloading this model it might be worth looking at both starcoderbase and santacoder (santacoder is trained only on python).

from transformers import HfAgent

agent = HfAgent(
    url_endpoint="https://api-inference.huggingface.co/models/bigcode/santacoder"
)
agent.run("Draw me a picture of rivers and lakes.")
ValueError: Error 422: {'error': 'Input validation error: `inputs` tokens + `max_new_tokens` must be <= 1512. Given: 1549 `inputs` tokens and 200 `max_new_tokens`', 'error_type': 'validation'}

The large prompt we saw before is our downfall here. Unfortunately it’s just too big for this model. The starcoder model accepts up to 8k tokens, which is a lot, while this one is only 1512.

One way to reduce this would be to cut some of the tools that it could use. This can work because the tools are held in a dictionary on the agent, and their description is interpolated into the prompt. If I delete some of them then it should make enough space to run the model.

from transformers import HfAgent

agent = HfAgent(
    url_endpoint="https://api-inference.huggingface.co/models/bigcode/santacoder"
)
for tool in [
    "document_qa",
    "image_captioner",
    "image_qa",
    "transcriber",
    "summarizer",
    "text_classifier",
    "text_qa",
    "text_reader",
    "translator",
    "text_downloader",
]:
    del agent._toolbox[tool]

agent.run("Draw me a picture of rivers and lakes.")
==Explanation from the agent==
I will use the following  tools: `image_drawer` to draw the image.


==Code generated by the agent==
image = image_drawer(image)


==Result==
Evaluation of the code stopped at line 0 before the end because of the following error:
It is not permitted to evaluate other functions than the provided tools (tried to execute image_drawer).

To get this to execute I had to delete 10 of the 14 available tools because there is a second restriction that the input length not exceed 1024 tokens. Deleting these tools means that the model doesn’t fully understand the task that it is being asked to perform. The examples of correct behaviour reference these unknown tools and this may be why the agent made up the function to call.

from transformers import HfAgent

agent = HfAgent(
    url_endpoint="https://api-inference.huggingface.co/models/bigcode/starcoderbase"
)
image = agent.run("Draw me a picture of rivers and lakes.")
image.save("rivers-and-lakes.jpg")
==Explanation from the agent==
I will use the following  tool: `image_generator` to generate an image.


==Code generated by the agent==
image = image_generator(prompt="rivers and lakes")


==Result==

rivers and lakes

This is a nice image. It seems that the star coder base model was able to correctly interpret the first request. Looking at the model page I am not sure how this model differs from the full starcoder model - it’s still 15.5B parameters, and looking at the files suggests that is not a typo.

Local Agent

If I want to run this agent along with all of the tools that it might invoke then I have a problem. I can try to load the starcoder model in int8 precision as that should work on the CPU and have a large enough input window to

Code
from transformers import Agent, AutoModelForCausalLM, AutoTokenizer
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

class LocalAgent(Agent):
    def __init__(
        self,
        model_name: str,
        chat_prompt_template=None,
        run_prompt_template=None,
        additional_tools=None,
        device_map: str = "auto",
        load_in_8bit: bool = True,
        llm_int8_enable_fp32_cpu_offload: bool = True,
        torch_dtype: type = torch.float16,
        max_new_tokens: int = 200,
        **generation_kwargs
    ):
        super().__init__(
            chat_prompt_template=chat_prompt_template,
            run_prompt_template=run_prompt_template,
            additional_tools=additional_tools,
        )
        quantization_config = BitsAndBytesConfig(
            load_in_8bit=load_in_8bit,
            llm_int8_enable_fp32_cpu_offload=llm_int8_enable_fp32_cpu_offload,
        )
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map=device_map,
            quantization_config=quantization_config,
            torch_dtype=torch_dtype,
            resume_download=True,
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.generation_kwargs = {
            "max_new_tokens": max_new_tokens,
        } | generation_kwargs
            
    @torch.inference_mode()
    def generate_one(self, prompt, stop):
        inputs = self.tokenizer(prompt, return_tensors="pt")
        inputs.to(self.model.device)

        output = self.model.generate(
            **inputs,
            **self.generation_kwargs,
        )
        generated_tokens = output[0, inputs.input_ids.shape[1]:]
        result = self.tokenizer.decode(generated_tokens)

        # the model will generate further "tasks" following the conclusion of this one
        # the stop list contains 'Task:' and we can use that to remove the subsequent tasks
        # ideally we would adjust generation to stop after the first stop_seq was issued
        for stop_seq in stop:
            result = result.split(stop_seq)[0]

        return result
agent = LocalAgent(model_name="bigcode/starcoder")
image = agent.run("Draw me a picture of an astronaut")
image.save("local-astronaut.jpg")
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
==Explanation from the agent==
I will use the following  tool: `image_generator` to generate an image.


==Code generated by the agent==
image = image_generator(prompt="astronaut")


==Result==
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.

astronaut

This works and it’s a nice point to wrap up. Loading the local model takes a few minutes and I have to be very careful about the available GPU memory. Quantizing and saving to disk would be a good way to improve it. I can see why they created this using a remote API though.