Matthew’s Blog - OpenAI Clip Evaluation

OpenAI have released a new model called CLIP (blog post) which combines image processing with natural language. It is a classification model that can classify images using previously unknown labels. The labels are potential descriptions for an image.

The model was trained using images from the internet along with their associated description. It has to choose the correct description for the image from a large number of potential descriptions. To solve this problem the model has to learn visual concepts which would correlate with the correct description.

The blog post and results seem really impressive, so I wanted to try it out myself.

Code

from pathlib import Path
from PIL import Image

picture = Image.open(Path("cat.jpg"))
labels = ["a diagram", "a dog", "a cat"]

picture

Code

import torch
import blog.clip.clip as clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform = clip.load("ViT-B/32", device=device)

image = transform(picture).unsqueeze(0).to(device)
text = clip.tokenize(labels).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

pretty_probabilities = "\n".join(
    f"{label:>16s}: {100 * probability:>5.2f}%"
    for label, probability in sorted(
        zip(labels, probs[0]),
        key=lambda pair: pair[1],
        reverse=True
    )
)
print(pretty_probabilities)

ModuleNotFoundError: No module named 'torchvision'

So it’s a great start.

What I’m interested in now is how well it can handle multi label classification. It’s become late already so I’ll have to try that out another day though.