I’ve been interested in the CLIP model and derivatives. I understand that DALL-E is based on encoding the prompt into a shared space with the decoder that then generates the image. Image captioning can be thought of as the reverse of that - encode the image into a shared space and then decode that as text.
I have found a couple of models on huggingface for caption generation and I thought it would be nice to see how well they work.
Code
from pathlib import PathIMAGE_FOLDER = Path("/data/openimages/external/train_0")IMAGES = [image for index, image inenumerate(IMAGE_FOLDER.glob("*.jpg")) if index <10]MODEL_NAME ="nlpconnect/vit-gpt2-image-captioning"# MODEL_NAME = "ydshieh/vit-gpt2-coco-en"MAXIMUM_SEQUENCE_LENGTH =16NUMBER_OF_BEAMS =4
Of these two models the nlpconnect model seems to be the more widely used. It has almost 10x the monthly downloads of the other. These models appear to be based on some kind of ViT (vision transformer) model which is encouraging as CLIP is based on a ViT-B/32 Transformer architecture.
This has been adjusted to show the image with the caption. We can use this to see how well the caption works with the first 10 images from my openimages download.
Code
for image in IMAGES: generate_caption(image)
a bottle of beer sitting on top of a wooden table
a baseball player pitching a ball on top of a field
a flock of birds flying through the air
a crowd of people sitting around a tent
a woman standing on top of a wooden table
a person laying on the ground with a pair of sandals
a close up view of a black and white car
a man and a woman smile as they pose for a picture
a green street sign sitting on the side of a road
boats are docked in the water
Overall I think that these captions vary in quality.
Caption Review
There are the following problems:
Code
# hide_inputresize(Image.open(IMAGES[2]), 256)
This image has been captioned with a flock of birds flying through the air which seems to be wrong. I think a better caption would be “a tree in blossom in front of flats”.
The problem here is likely the visually noisy nature of the image.
Code
# hide_inputresize(Image.open(IMAGES[5]), 256)
This image has been captioned with a person laying on the ground with a pair of sandals. You cannot see the person and given the angle of the foot they would likely be standing.
The problem here appears to relate to insufficient content in the image as it is inferring details not present in the image.
Code
# hide_inputresize(Image.open(IMAGES[6]), 256)
This image has been captioned with a close up view of a black and white car. The image appears black and white because of the black tyre and steel hubcap, however it is in color.
The problem here is the limited color range of the image.
Code
# hide_inputresize(Image.open(IMAGES[7]), 256)
This image has been captioned with a man and a woman smile as they pose for a picture. I particularly like this caption.
Overall the captions are good when they are accurate, however they can lack accuracy. The text generation seems like it can become detached from the observed elements of the image.
Maybe there is a more mechanical approach of identifying objects in the image, identifying relations, and then using some system to describe those. The current system reminds me of the summarizer models I have seen, where they can hallucinate content from the training set when the input lacks content.