JINA + CLIP for Image and Text Search

JINA is a content insensitive document index. Can I use it to index images with CLIP and then search for them?
Published

October 29, 2021

Jina

Jina is a framework that is composed of Documents, Executors and Flows. Flows compose Executors and Executors alter Documents. Searching is a primary function of this framework, so producing a search engine using this seems appropriate.

Given that I have recently been working with CLIP it would be fun to try to make a searchable database of images.

Documents

The Document is the smallest unit of data in Jina. These can be composed into DocumentArrays, and there are various ways to load and persist them. Let’s have a look at the basic text document first.

Text Document

This is the simplest constructor, and the content of this call is then inferred to be plain text. The content field is a facade over the stored data, which is either text, a buffer or a binary blob.

Code
from jina import Document

text_document = Document(content="hello world")
text_document

Image Document

It’s also possible to create a document by providing a URI. Here I’m creating an image document in this way, and it has managed to infer the mime type of it.

Code
from jina import Document

image_document = Document(uri="/data/openimages/external/train_0/000002b66c9c498e.jpg")
image_document

Does this constructor load the data?

Code
image_document.content is None
True

It hasn’t actually loaded the image data so the inference of the mime type must be through the filename only. I can test this.

Code
from pathlib import Path
from tempfile import TemporaryDirectory

with TemporaryDirectory() as directory:
    file = Path(directory) / "image.png"
    file.write_bytes(b"") # nothing, just create the file
    display(Document(uri=str(file)))
    file.unlink()

This empty document was inferred to be a PNG image. For our purposes this should be fine. It might be a problem if URLs are used which only provide the mime type when requested (instead of including an extension).

The image data can be loaded using one of the helper methods. This populates the blob field with the image data (accessible through content, as mentioned earlier).

Code
from jina import Document
from PIL import Image

def show_thumbnail(document: Document) -> None:
    assert document.blob is not None
    image = Image.fromarray(document.blob)
    image.thumbnail(size=(128,128), resample=Image.ANTIALIAS)
    display(image)
Code
image_document.convert_image_uri_to_blob()
show_thumbnail(image_document)

It’s that party image that I used in the last post.

The convert_image_uri_to_blob method can’t be invoked blindly - it only works if the uri has been set on the document.

Composite Document

You can create a document that contains sub documents. In this way you can combine different data types, or split a large document into smaller chunks.

Code
composite_document = Document()
composite_document.chunks.append(image_document)
composite_document.chunks.append(text_document)
composite_document

We can also export any document as json like this:

Code
# don't want to see the full image data
composite_document.chunks[0].pop("blob")

print(composite_document.json())
{
  "chunks": [
    {
      "granularity": 1,
      "id": "f4940514-3a54-11ec-9b07-8ff42447411c",
      "mime_type": "image/jpeg",
      "parent_id": "f9f884d4-3a54-11ec-9b07-8ff42447411c",
      "uri": "/data/openimages/external/train_0/000002b66c9c498e.jpg"
    },
    {
      "granularity": 1,
      "id": "f4940513-3a54-11ec-9b07-8ff42447411c",
      "mime_type": "text/plain",
      "parent_id": "f9f884d4-3a54-11ec-9b07-8ff42447411c",
      "text": "hello world"
    }
  ],
  "id": "f9f884d4-3a54-11ec-9b07-8ff42447411c"
}

Annoyingly it does not seem like there is a nice way to traverse the parent document and all chunks in one loop. This causes my executors to be slightly more involved.

Executors

Documents are only worthwhile if we can process them. The primary bit of processing I am interested in is the creation of the embedding. This is a vector describing the document that can be used for matching.

Load URI Executor

To achieve this I am creating two executors. The first will ensure that the content is loaded, as I like to create the image documents via the uri.

Code
# from src/main/python/blog/image_text_search/executors/load_data.py
from jina import Document, DocumentArray, Executor, requests


class LoadData(Executor):
    @requests
    def load_data(self, docs: DocumentArray, **kwargs) -> None:
        for document in docs:
            self._recurse(document)

    def _recurse(self, document: Document) -> None:
        self._load_data(document)
        for child_document in document.chunks:
            self._recurse(child_document)

    @staticmethod
    def _load_data(document: Document) -> None:
        if not document.mime_type:
            return
        if document.content is not None:
            return
        if not document.uri:
            return
        if document.mime_type == "text/plain":
            document.convert_uri_to_text()
        elif document.mime_type.startswith("image/"):
            document.convert_image_uri_to_blob()

The @requests annotated method is the entrypoint of the executor. It takes quite a few different arguments and can return many different things. In this case I am altering the documents in place. Altering in place is normally the easiest thing to do (can see documentation around this here).

This is complex just because of the need to recurse through the chunks. Doing this allows the executor to work with the composite document that was created earlier. This recusion does make it harder to return an updated document array.

If you have more restricted document structures then you could get away without it. I do think that the flows perform some protective copying to allow the mutation of the documents to be restriced. We can see that later.

CLIP Embedding Executor

The next thing is to generate the embedding. This is where CLIP comes in. I want a single executor that can handle both images and text, as well as composite documents. Having such an executor reduces the number of times I am loading the model as well as making it easier to index and search.

Code
# from src/main/python/blog/image_text_search/executors/clip_embedding.py
from typing import Optional

import clip
import numpy as np
import torch
from jina import Document, DocumentArray, Executor, requests
from PIL import Image


class ClipEmbeddings(Executor):
    def __init__(self, model: str = "ViT-B/32", device: str = "cpu", **kwargs) -> None:
        super().__init__(**kwargs)
        self.model, self.preprocess = clip.load(model, device=device)
        self.model.eval()
        self.device = device

    @requests
    def add_embeddings(self, docs: DocumentArray, **kwargs) -> None:
        for document in docs:
            self._add_embedding(document)

    def _add_embedding(self, document: Document) -> None:
        embedding = self._embedding(document)
        if embedding is not None:
            document.embedding = embedding
        for child_document in document.chunks:
            self._add_embedding(child_document)

    def _embedding(self, document: Document) -> Optional[np.array]:
        if not document.mime_type:
            return None
        if document.content is None:
            return None
        if document.mime_type == "text/plain":
            return self._text_embedding(document.content)
        if document.mime_type.startswith("image/"):
            return self._image_embedding(document.content)
        return None

    @torch.no_grad()
    def _text_embedding(self, text: str) -> np.array:
        tokens = clip.tokenize(text).to(self.device)
        tensor = self.model.encode_text(tokens)[0]
        return tensor.cpu().numpy()

    @torch.no_grad()
    def _image_embedding(self, blob: np.array) -> np.array:
        image = Image.fromarray(blob)
        preprocessed_image = self.preprocess(image).unsqueeze(0).to(self.device)
        tensor = self.model.encode_image(preprocessed_image)[0]
        return tensor.cpu().numpy()

This has the same structure as the data loading executor. The one difference is that it can generate the specific data embeddings using the CLIP model.

Flows

Executors are composed into flows. A flow is a directed acyclic graph of operations to perform to a document. You can replicate executors to provide parallelism or sharding, all of which is well documented.

In this post I’m going to keep it simple.

Generate Embedding Flow

This flow will load the document data and then generate the embedding for it.

Code
from jina import Flow

flow = (
    Flow()
        .add(uses=LoadData)
        .add(uses=ClipEmbeddings)
)
flow

It’s nice and simple. The creation of the flow is very fast because none of the executors have been created yet. When the flow is used the executors are created.

To use a flow you have to load it using the with statement. Doing this loads all the executors and you can then pass documents through the flow.

Code
with flow:
    response = flow.index(
        inputs=composite_document,
        return_results=True,
    )
response[0].docs[0]
           Flow@217463[I]:🎉 Flow is ready to use!                                         
    🔗 Protocol:         GRPC
    🏠 Local access: 0.0.0.0:40949
    🔒 Private network:  192.168.1.54:40949
    🌐 Public address:   81.2.75.20:40949

You can see that an embedding field has turned up on the two child documents. The original documents are unaltered at this point:

Code
composite_document

The original document lacks the embedding or blob data. Mutating the documents in place is a lot more justifiable given that it doesn’t alter the source.

Using return_results is intended for debugging only. This is because the entire set of results has to be held in memory. The correct way to handle the results from a flow is to hook into one of the three methods available. They allow you to handle the documents as they exit the flow, which then means the memory they use can be released.

Code
with flow:
    flow.index(
        inputs=composite_document,
        on_done=lambda response: display(response.docs[0]),
    )
           Flow@217463[I]:🎉 Flow is ready to use!                                         
    🔗 Protocol:         GRPC
    🏠 Local access: 0.0.0.0:58341
    🔒 Private network:  192.168.1.54:58341
    🌐 Public address:   81.2.75.20:58341

Trying to display the response causes problems because it is a recursive data structure. The response wraps the documents and can hold any errors encountered during processing.

The current structure of the flow is nice because it means I can process images from the internet directly:

Code
with flow:
    flow.index(
        inputs=Document(uri="https://upload.wikimedia.org/wikipedia/commons/8/8b/08_Chevrolet_Malibu_LT_.jpg"),
        on_done=lambda response: show_thumbnail(response.docs[0]),
    )
           Flow@217463[I]:🎉 Flow is ready to use!                                         
    🔗 Protocol:         GRPC
    🏠 Local access: 0.0.0.0:35689
    🔒 Private network:  192.168.1.54:35689
    🌐 Public address:   81.2.75.20:35689

Jina Hub

This all looks good but it’s quite low level stuff. What I need is for someone else to do the hard work of creating all the executors. Then I could load them and make a fancy search application with minimal effort. I like minimal effort.

The place that has all the executors is called Jina Hub and it allows you to load predefined executors. It’s also capable of installing the dependencies they require (which didn’t work for me, but that may be a pip thing). You can also run the executor in a docker container.

Let’s try to combine the executors we have with something that will index and search the embeddings. This would allow us to create a full image search engine.

Simple Indexer Flow

This Flow uses the SimpleIndexer which just holds the documents in memory. Once they are indexed you can search them by passing in more documents. It’s a brute force search so it is not appropriate for large datasets. It can write the index to disk however I have not been able to trigger that reliably.

Code
from jina import Flow

flow = (
    Flow()
        .add(uses=LoadData)
        .add(uses=ClipEmbeddings)
        .add(
            uses="jinahub+docker://SimpleIndexer",
            uses_metas={'workspace': 'image-search'},
        )
)
flow

Now we can index some documents and then perform a search.

Code
from jina import Document

image_documents = [
    Document(uri=f"/data/openimages/external/train_0/{name}")
    for name in [
        "000002b66c9c498e.jpg",
        "000002b97e5471a0.jpg",
        "000002c707c9895e.jpg",
        "0000048549557964.jpg",
        "000004f4400f6ec5.jpg",
        "0000071d71a0a6f6.jpg",
        "000013ba71c12506.jpg",
        "000018acd19b4ad3.jpg",
        "00001bc2c4027449.jpg",
        "00001bcc92282a38.jpg",
    ]
]

The first step is to index these documents. SimpleIndexer will add them to an in memory array when I invoke the flow with index. Then when the flow is invoked with search it will match the indexed documents against the current document.

Since the index is held in memory both steps have to be done within the same context. Running this does produce quite a lot of output, and the SimpleIndexer doesn’t seem to be the highest quality code. Given it is totally unsuitable for production use that is fine.

Code
#collapse_output
from PIL import Image

with flow:
    flow.index(inputs=image_documents)
    
    results = flow.search(
        inputs=Document(text="a photo of a building"),
        return_results=True,
    )
⠼ 2/4 waiting executor1 executor2 to be ready...                                        executor2@272327[I]:
      executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMWWWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMWNNNNNNNWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMNNNNNNNNNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMWNNNNNNNNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMWNNNWWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMMMMMMMMMMMWxxxxxxxxxOMMMMMNxxxxxxxxx0MMMMMKddddddxkKWMMMMMMMMMMMMXOxdddxONMMMM
      executor2@272327[I]:MMMMMMMMMMMMXllllllllldMMMMM0lllllllllxMMMMMOllllllllllo0MMMMMMMM0olllllllllo0MM
      executor2@272327[I]:MMMMMMMMMMMMXllllllllldMMMMM0lllllllllxMMMMMOlllllllllllloWMMMMMdllllllllllllldM
      executor2@272327[I]:MMMMMMMMMMMMXllllllllldMMMMM0lllllllllxMMMMMOllllllllllllloMMMM0lllllllllllllllK
      executor2@272327[I]:MMMMMMMMMMMMKllllllllldMMMMM0lllllllllxMMMMMOllllllllllllllKMMM0lllllllllllllllO
      executor2@272327[I]:MMMMMMMMMMMMKllllllllldMMMMM0lllllllllxMMMMMOllllllllllllll0MMMMollllllllllllllO
      executor2@272327[I]:MWOkkkkk0MMMKlllllllllkMMMMM0lllllllllxMMMMMOllllllllllllll0MMMMMxlllllllllllllO
      executor2@272327[I]:NkkkkkkkkkMMKlllllllloMMMMMM0lllllllllxMMMMMOllllllllllllll0MMMMMMWOdolllllllllO
      executor2@272327[I]:KkkkkkkkkkNMKllllllldMMMMMMMMWWWWWWWWWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MOkkkkkkk0MMKllllldXMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:MMWX00KXMMMMXxk0XMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
      executor2@272327[I]:
      executor2@272327[I]:▶️  /usr/local/bin/jina executor --uses config.yml --name executor2 --workspace /home/matthew/Programming/Blog/blog/_notebooks --identity 2c211110-2cfc-4b78-8e58-b40c851464fa --workspace-id e2847d66-ec84-4dea-909d-2d65be50e84e --zmq-identity 82825c89-76fc-459d-8ab4-85a495472241 --port-ctrl 55029 --uses-metas {"workspace": "image-search"} --port-in 45357 --port-out 52703 --hosts-in-connect --socket-in ROUTER_BIND --socket-out ROUTER_BIND --native --num-part 1 --dynamic-routing-out --dynamic-routing-in --runs-in-docker --upload-files --noblock-on-start
      executor2@272327[I]:🔧️                            cli = executor                      
      executor2@272327[I]:ctrl-with-ipc = False
      executor2@272327[I]:daemon = False
      executor2@272327[I]:disable-remote = False
      executor2@272327[I]:docker-kwargs = None
      executor2@272327[I]:dump-path =
      executor2@272327[I]:dynamic-routing = True
      executor2@272327[I]:🔧️             dynamic-routing-in = True                          
      executor2@272327[I]:🔧️            dynamic-routing-out = True                          
      executor2@272327[I]:entrypoint = None
      executor2@272327[I]:env = None
      executor2@272327[I]:expose-public = False
      executor2@272327[I]:extra-search-paths = []
      executor2@272327[I]:force = False
      executor2@272327[I]:gpus = None
      executor2@272327[I]:grpc-data-requests = False
      executor2@272327[I]:host = 0.0.0.0
      executor2@272327[I]:host-in = 0.0.0.0
      executor2@272327[I]:host-out = 0.0.0.0
      executor2@272327[I]:🔧️               hosts-in-connect = []                            
      executor2@272327[I]:🔧️                       identity = 2c211110-2cfc-4b78-8e58-b40c85
      executor2@272327[I]:install-requirements = False
      executor2@272327[I]:🔧️            k8s-connection-pool = True                          
      executor2@272327[I]:k8s-namespace = None
      executor2@272327[I]:log-config = /usr/local/lib/python3.8/site-
      executor2@272327[I]:memory-hwm = -1
      executor2@272327[I]:🔧️                           name = executor2                     
      executor2@272327[I]:🔧️                         native = True                          
      executor2@272327[I]:🔧️               noblock-on-start = True                          
      executor2@272327[I]:🔧️                       num-part = 1                             
      executor2@272327[I]:on-error-strategy = IGNORE
      executor2@272327[I]:pea-role = SINGLETON
      executor2@272327[I]:🔧️                      port-ctrl = 55029                         
      executor2@272327[I]:🔧️                        port-in = 45357                         
      executor2@272327[I]:port-jinad = 8000
      executor2@272327[I]:🔧️                       port-out = 52703                         
      executor2@272327[I]:pull-latest = False
      executor2@272327[I]:py-modules = None
      executor2@272327[I]:quiet = False
      executor2@272327[I]:quiet-error = False
      executor2@272327[I]:quiet-remote-logs = False
      executor2@272327[I]:replicas = 1
      executor2@272327[I]:routing-table = None
      executor2@272327[I]:🔧️                 runs-in-docker = True                          
      executor2@272327[I]:runtime-backend = PROCESS
      executor2@272327[I]:runtime-cls = ZEDRuntime
⠴ 3/4 waiting executor1 to be ready...                                                  executor2@272327[I]:shard-id = 0
      executor2@272327[I]:shards = 1
      executor2@272327[I]:🔧️                      socket-in = ROUTER_BIND                   
      executor2@272327[I]:🔧️                     socket-out = ROUTER_BIND                   
      executor2@272327[I]:ssh-keyfile = None
      executor2@272327[I]:ssh-password = None
      executor2@272327[I]:ssh-server = None
      executor2@272327[I]:static-routing-table = False
      executor2@272327[I]:timeout-ctrl = 5000
      executor2@272327[I]:timeout-ready = 600000
      executor2@272327[I]:🔧️                   upload-files = []                            
      executor2@272327[I]:🔧️                           uses = config.yml                    
      executor2@272327[I]:🔧️                     uses-metas = {'workspace': 'image-search'} 
      executor2@272327[I]:uses-requests = None
      executor2@272327[I]:uses-with = None
      executor2@272327[I]:volumes = None
      executor2@272327[I]:🔧️                      workspace = /home/matthew/Programming/Blog
      executor2@272327[I]:🔧️                   workspace-id = e2847d66-ec84-4dea-909d-2d65be
      executor2@272327[I]:🔧️                   zmq-identity = 82825c89-76fc-459d-8ab4-85a495
      executor2@272327[I]:
      executor2@272327[I]:      executor2@ 1[L]: Executor SimpleIndexer started
      executor2@272327[I]:           JINA@ 1[W]:You are using Jina version 2.1.13, however version 2.2.1 is available. You should consider upgrading via the "pip install --upgrade jina" command.
           Flow@217463[I]:🎉 Flow is ready to use!                                         
    🔗 Protocol:         GRPC
    🏠 Local access: 0.0.0.0:45755
    🔒 Private network:  192.168.1.54:45755
    🌐 Public address:   81.2.75.20:45755
      executor2@272327[I]:UserWarning: It looks like you are trying to import multiple python modules using `py_modules`. When using multiple python files to define an executor, the recommended practice is to structure the files in a python package, and only import the `__init__.py` file of that package. For more details, please check out the cookbook: https://docs.jina.ai/fundamentals/executor/repository-structure/ (raised from /usr/local/lib/python3.8/site-packages/jina/jaml/helper.py:244)
      executor2@272327[I]:UserWarning:
      executor2@272327[I]:executor shadows one of built-in Python module name.
      executor2@272327[I]:It is imported as `user_module.executor`
      executor2@272327[I]:
      executor2@272327[I]:Affects:
      executor2@272327[I]:- Either, change your code from using `from executor import ...`
      executor2@272327[I]:to `from user_module.executor import ...`
      executor2@272327[I]:- Or, rename executor to another name
      executor2@272327[I]: (raised from /usr/local/lib/python3.8/site-packages/jina/importer.py:111)
Code
def show_matches(responses) -> None:
    for response in responses:
        for document in response.docs:
            display(document)
            for match in document.matches[:5]:
                display({key: value.value for key, value in match.scores.items()})
                show_thumbnail(match)
Code
show_matches(results)
{'cosine': 0.7256965637207031}

{'cosine': 0.7519792914390564}

{'cosine': 0.7634004354476929}

{'cosine': 0.8040722608566284}

{'cosine': 0.8224889039993286}

Here you can see the document that is returned as the result. It then has a link to all of the documents in the index, which are stored in the matches attribute. By iterating through the top 5 results we can see the images as well as the score they got. What surprises me is that the score of the top results is the lowest score. The top result is a strong match for the query though (a photo of a building) so it is working.

The document match documentation states that the match results always return from lowest to highest so maybe the cosine similarity has been mapped such that a perfect match is 0 and it increments from there. The cosine function is implemented here and is \(1 - \text{cosine similarity}\) which means it ranges from 0 (perfect match) to 2 (perfect mismatch).

That does mean that these results are not incredibly strong matches. CLIP is known to be sensitive to the prompt.

Conclusion

This concludes the initial investigation of Jina. I think that it has done well so far.

To make this into a “production” system I would need a more reliable indexer and to shard the index. There are several faiss executors available in the hub so it shouldn’t be too hard to do.

The initial results of the image search suggest that this should all work. The space and compute required are the real limitations.