Image Caption Generation

How well do image captioning models work?
Published

June 17, 2022

I’ve been interested in the CLIP model and derivatives. I understand that DALL-E is based on encoding the prompt into a shared space with the decoder that then generates the image. Image captioning can be thought of as the reverse of that - encode the image into a shared space and then decode that as text.

I have found a couple of models on huggingface for caption generation and I thought it would be nice to see how well they work.

Code
from pathlib import Path

IMAGE_FOLDER = Path("/data/openimages/external/train_0")
IMAGES = [image for index, image in enumerate(IMAGE_FOLDER.glob("*.jpg")) if index < 10]

MODEL_NAME = "nlpconnect/vit-gpt2-image-captioning"
# MODEL_NAME = "ydshieh/vit-gpt2-coco-en"

MAXIMUM_SEQUENCE_LENGTH = 16
NUMBER_OF_BEAMS = 4

Of these two models the nlpconnect model seems to be the more widely used. It has almost 10x the monthly downloads of the other. These models appear to be based on some kind of ViT (vision transformer) model which is encouraging as CLIP is based on a ViT-B/32 Transformer architecture.

The following code is based on the model card.

Code
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer

model = VisionEncoderDecoderModel.from_pretrained(MODEL_NAME)
model.cuda()
model.eval()

feature_extractor = ViTFeatureExtractor.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
Code
from pathlib import Path
from PIL import Image

max_length = 50
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}


def generate_caption(
    path: Path,
    max_length: int = MAXIMUM_SEQUENCE_LENGTH,
    num_beams: int = NUMBER_OF_BEAMS,
) -> None:
    image = Image.open(path)
    if image.mode != "RGB":
        image = image.convert(mode="RGB")

    pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(model.device)
    output_ids = model.generate(
        pixel_values,
        max_length=max_length,
        num_beams=num_beams,
    )

    captions = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    caption = captions[0].strip()

    display(resize(image, 256))
    print(caption)


def resize(image: Image.Image, max_width: int) -> Image.Image:
    width, height = image.size
    if width < max_width:
        return image
    ratio = max_width / width
    return image.resize((max_width, int(height * ratio)))

Image Captioning

This has been adjusted to show the image with the caption. We can use this to see how well the caption works with the first 10 images from my openimages download.

Code
for image in IMAGES:
    generate_caption(image)

a bottle of beer sitting on top of a wooden table

a baseball player pitching a ball on top of a field

a flock of birds flying through the air

a crowd of people sitting around a tent

a woman standing on top of a wooden table

a person laying on the ground with a pair of sandals

a close up view of a black and white car

a man and a woman smile as they pose for a picture

a green street sign sitting on the side of a road

boats are docked in the water

Overall I think that these captions vary in quality.

Caption Review

There are the following problems:

Code
# hide_input
resize(Image.open(IMAGES[2]), 256)

This image has been captioned with a flock of birds flying through the air which seems to be wrong. I think a better caption would be “a tree in blossom in front of flats”.

The problem here is likely the visually noisy nature of the image.

Code
# hide_input
resize(Image.open(IMAGES[5]), 256)

This image has been captioned with a person laying on the ground with a pair of sandals. You cannot see the person and given the angle of the foot they would likely be standing.

The problem here appears to relate to insufficient content in the image as it is inferring details not present in the image.

Code
# hide_input
resize(Image.open(IMAGES[6]), 256)

This image has been captioned with a close up view of a black and white car. The image appears black and white because of the black tyre and steel hubcap, however it is in color.

The problem here is the limited color range of the image.

Code
# hide_input
resize(Image.open(IMAGES[7]), 256)

This image has been captioned with a man and a woman smile as they pose for a picture. I particularly like this caption.

Overall the captions are good when they are accurate, however they can lack accuracy. The text generation seems like it can become detached from the observed elements of the image.

Maybe there is a more mechanical approach of identifying objects in the image, identifying relations, and then using some system to describe those. The current system reminds me of the summarizer models I have seen, where they can hallucinate content from the training set when the input lacks content.