Code
from jina import Document
= Document(content="hello world")
text_document text_document
October 29, 2021
Jina is a framework that is composed of Documents, Executors and Flows. Flows compose Executors and Executors alter Documents. Searching is a primary function of this framework, so producing a search engine using this seems appropriate.
Given that I have recently been working with CLIP it would be fun to try to make a searchable database of images.
The Document is the smallest unit of data in Jina. These can be composed into DocumentArrays, and there are various ways to load and persist them. Let’s have a look at the basic text document first.
This is the simplest constructor, and the content
of this call is then inferred to be plain text. The content field is a facade over the stored data, which is either text, a buffer or a binary blob.
It’s also possible to create a document by providing a URI. Here I’m creating an image document in this way, and it has managed to infer the mime type of it.
Does this constructor load the data?
It hasn’t actually loaded the image data so the inference of the mime type must be through the filename only. I can test this.
This empty document was inferred to be a PNG image. For our purposes this should be fine. It might be a problem if URLs are used which only provide the mime type when requested (instead of including an extension).
The image data can be loaded using one of the helper methods. This populates the blob
field with the image data (accessible through content
, as mentioned earlier).
It’s that party image that I used in the last post.
The convert_image_uri_to_blob
method can’t be invoked blindly - it only works if the uri
has been set on the document.
You can create a document that contains sub documents. In this way you can combine different data types, or split a large document into smaller chunks.
We can also export any document as json like this:
{
"chunks": [
{
"granularity": 1,
"id": "f4940514-3a54-11ec-9b07-8ff42447411c",
"mime_type": "image/jpeg",
"parent_id": "f9f884d4-3a54-11ec-9b07-8ff42447411c",
"uri": "/data/openimages/external/train_0/000002b66c9c498e.jpg"
},
{
"granularity": 1,
"id": "f4940513-3a54-11ec-9b07-8ff42447411c",
"mime_type": "text/plain",
"parent_id": "f9f884d4-3a54-11ec-9b07-8ff42447411c",
"text": "hello world"
}
],
"id": "f9f884d4-3a54-11ec-9b07-8ff42447411c"
}
Annoyingly it does not seem like there is a nice way to traverse the parent document and all chunks in one loop. This causes my executors to be slightly more involved.
Documents are only worthwhile if we can process them. The primary bit of processing I am interested in is the creation of the embedding. This is a vector describing the document that can be used for matching.
To achieve this I am creating two executors. The first will ensure that the content
is loaded, as I like to create the image documents via the uri.
# from src/main/python/blog/image_text_search/executors/load_data.py
from jina import Document, DocumentArray, Executor, requests
class LoadData(Executor):
@requests
def load_data(self, docs: DocumentArray, **kwargs) -> None:
for document in docs:
self._recurse(document)
def _recurse(self, document: Document) -> None:
self._load_data(document)
for child_document in document.chunks:
self._recurse(child_document)
@staticmethod
def _load_data(document: Document) -> None:
if not document.mime_type:
return
if document.content is not None:
return
if not document.uri:
return
if document.mime_type == "text/plain":
document.convert_uri_to_text()
elif document.mime_type.startswith("image/"):
document.convert_image_uri_to_blob()
The @requests
annotated method is the entrypoint of the executor. It takes quite a few different arguments and can return many different things. In this case I am altering the documents in place. Altering in place is normally the easiest thing to do (can see documentation around this here).
This is complex just because of the need to recurse through the chunks. Doing this allows the executor to work with the composite document that was created earlier. This recusion does make it harder to return an updated document array.
If you have more restricted document structures then you could get away without it. I do think that the flows perform some protective copying to allow the mutation of the documents to be restriced. We can see that later.
The next thing is to generate the embedding. This is where CLIP comes in. I want a single executor that can handle both images and text, as well as composite documents. Having such an executor reduces the number of times I am loading the model as well as making it easier to index and search.
# from src/main/python/blog/image_text_search/executors/clip_embedding.py
from typing import Optional
import clip
import numpy as np
import torch
from jina import Document, DocumentArray, Executor, requests
from PIL import Image
class ClipEmbeddings(Executor):
def __init__(self, model: str = "ViT-B/32", device: str = "cpu", **kwargs) -> None:
super().__init__(**kwargs)
self.model, self.preprocess = clip.load(model, device=device)
self.model.eval()
self.device = device
@requests
def add_embeddings(self, docs: DocumentArray, **kwargs) -> None:
for document in docs:
self._add_embedding(document)
def _add_embedding(self, document: Document) -> None:
embedding = self._embedding(document)
if embedding is not None:
document.embedding = embedding
for child_document in document.chunks:
self._add_embedding(child_document)
def _embedding(self, document: Document) -> Optional[np.array]:
if not document.mime_type:
return None
if document.content is None:
return None
if document.mime_type == "text/plain":
return self._text_embedding(document.content)
if document.mime_type.startswith("image/"):
return self._image_embedding(document.content)
return None
@torch.no_grad()
def _text_embedding(self, text: str) -> np.array:
tokens = clip.tokenize(text).to(self.device)
tensor = self.model.encode_text(tokens)[0]
return tensor.cpu().numpy()
@torch.no_grad()
def _image_embedding(self, blob: np.array) -> np.array:
image = Image.fromarray(blob)
preprocessed_image = self.preprocess(image).unsqueeze(0).to(self.device)
tensor = self.model.encode_image(preprocessed_image)[0]
return tensor.cpu().numpy()
This has the same structure as the data loading executor. The one difference is that it can generate the specific data embeddings using the CLIP model.
Executors are composed into flows. A flow is a directed acyclic graph of operations to perform to a document. You can replicate executors to provide parallelism or sharding, all of which is well documented.
In this post I’m going to keep it simple.
This flow will load the document data and then generate the embedding for it.
It’s nice and simple. The creation of the flow is very fast because none of the executors have been created yet. When the flow is used the executors are created.
To use a flow you have to load it using the with
statement. Doing this loads all the executors and you can then pass documents through the flow.
Flow@217463[I]:🎉 Flow is ready to use!
🔗 Protocol: GRPC
🏠 Local access: 0.0.0.0:40949
🔒 Private network: 192.168.1.54:40949
🌐 Public address: 81.2.75.20:40949
You can see that an embedding
field has turned up on the two child documents. The original documents are unaltered at this point:
The original document lacks the embedding or blob data. Mutating the documents in place is a lot more justifiable given that it doesn’t alter the source.
Using return_results
is intended for debugging only. This is because the entire set of results has to be held in memory. The correct way to handle the results from a flow is to hook into one of the three methods available. They allow you to handle the documents as they exit the flow, which then means the memory they use can be released.
Flow@217463[I]:🎉 Flow is ready to use!
🔗 Protocol: GRPC
🏠 Local access: 0.0.0.0:58341
🔒 Private network: 192.168.1.54:58341
🌐 Public address: 81.2.75.20:58341
Trying to display
the response causes problems because it is a recursive data structure. The response wraps the documents and can hold any errors encountered during processing.
The current structure of the flow is nice because it means I can process images from the internet directly:
Flow@217463[I]:🎉 Flow is ready to use!
🔗 Protocol: GRPC
🏠 Local access: 0.0.0.0:35689
🔒 Private network: 192.168.1.54:35689
🌐 Public address: 81.2.75.20:35689
This all looks good but it’s quite low level stuff. What I need is for someone else to do the hard work of creating all the executors. Then I could load them and make a fancy search application with minimal effort. I like minimal effort.
The place that has all the executors is called Jina Hub and it allows you to load predefined executors. It’s also capable of installing the dependencies they require (which didn’t work for me, but that may be a pip thing). You can also run the executor in a docker container.
Let’s try to combine the executors we have with something that will index and search the embeddings. This would allow us to create a full image search engine.
This Flow uses the SimpleIndexer which just holds the documents in memory. Once they are indexed you can search them by passing in more documents. It’s a brute force search so it is not appropriate for large datasets. It can write the index to disk however I have not been able to trigger that reliably.
Now we can index some documents and then perform a search.
from jina import Document
image_documents = [
Document(uri=f"/data/openimages/external/train_0/{name}")
for name in [
"000002b66c9c498e.jpg",
"000002b97e5471a0.jpg",
"000002c707c9895e.jpg",
"0000048549557964.jpg",
"000004f4400f6ec5.jpg",
"0000071d71a0a6f6.jpg",
"000013ba71c12506.jpg",
"000018acd19b4ad3.jpg",
"00001bc2c4027449.jpg",
"00001bcc92282a38.jpg",
]
]
The first step is to index these documents. SimpleIndexer will add them to an in memory array when I invoke the flow with index
. Then when the flow is invoked with search
it will match the indexed documents against the current document.
Since the index is held in memory both steps have to be done within the same context. Running this does produce quite a lot of output, and the SimpleIndexer doesn’t seem to be the highest quality code. Given it is totally unsuitable for production use that is fine.
⠼ 2/4 waiting executor1 executor2 to be ready... executor2@272327[I]:
executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMWWWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMWNNNNNNNWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMNNNNNNNNNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMWNNNNNNNNMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMWNNNWWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMMMMMMMMMMMWxxxxxxxxxOMMMMMNxxxxxxxxx0MMMMMKddddddxkKWMMMMMMMMMMMMXOxdddxONMMMM
executor2@272327[I]:MMMMMMMMMMMMXllllllllldMMMMM0lllllllllxMMMMMOllllllllllo0MMMMMMMM0olllllllllo0MM
executor2@272327[I]:MMMMMMMMMMMMXllllllllldMMMMM0lllllllllxMMMMMOlllllllllllloWMMMMMdllllllllllllldM
executor2@272327[I]:MMMMMMMMMMMMXllllllllldMMMMM0lllllllllxMMMMMOllllllllllllloMMMM0lllllllllllllllK
executor2@272327[I]:MMMMMMMMMMMMKllllllllldMMMMM0lllllllllxMMMMMOllllllllllllllKMMM0lllllllllllllllO
executor2@272327[I]:MMMMMMMMMMMMKllllllllldMMMMM0lllllllllxMMMMMOllllllllllllll0MMMMollllllllllllllO
executor2@272327[I]:MWOkkkkk0MMMKlllllllllkMMMMM0lllllllllxMMMMMOllllllllllllll0MMMMMxlllllllllllllO
executor2@272327[I]:NkkkkkkkkkMMKlllllllloMMMMMM0lllllllllxMMMMMOllllllllllllll0MMMMMMWOdolllllllllO
executor2@272327[I]:KkkkkkkkkkNMKllllllldMMMMMMMMWWWWWWWWWMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MOkkkkkkk0MMKllllldXMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:MMWX00KXMMMMXxk0XMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
executor2@272327[I]:
executor2@272327[I]:▶️ /usr/local/bin/jina executor --uses config.yml --name executor2 --workspace /home/matthew/Programming/Blog/blog/_notebooks --identity 2c211110-2cfc-4b78-8e58-b40c851464fa --workspace-id e2847d66-ec84-4dea-909d-2d65be50e84e --zmq-identity 82825c89-76fc-459d-8ab4-85a495472241 --port-ctrl 55029 --uses-metas {"workspace": "image-search"} --port-in 45357 --port-out 52703 --hosts-in-connect --socket-in ROUTER_BIND --socket-out ROUTER_BIND --native --num-part 1 --dynamic-routing-out --dynamic-routing-in --runs-in-docker --upload-files --noblock-on-start
executor2@272327[I]:🔧️ cli = executor
executor2@272327[I]:ctrl-with-ipc = False
executor2@272327[I]:daemon = False
executor2@272327[I]:disable-remote = False
executor2@272327[I]:docker-kwargs = None
executor2@272327[I]:dump-path =
executor2@272327[I]:dynamic-routing = True
executor2@272327[I]:🔧️ dynamic-routing-in = True
executor2@272327[I]:🔧️ dynamic-routing-out = True
executor2@272327[I]:entrypoint = None
executor2@272327[I]:env = None
executor2@272327[I]:expose-public = False
executor2@272327[I]:extra-search-paths = []
executor2@272327[I]:force = False
executor2@272327[I]:gpus = None
executor2@272327[I]:grpc-data-requests = False
executor2@272327[I]:host = 0.0.0.0
executor2@272327[I]:host-in = 0.0.0.0
executor2@272327[I]:host-out = 0.0.0.0
executor2@272327[I]:🔧️ hosts-in-connect = []
executor2@272327[I]:🔧️ identity = 2c211110-2cfc-4b78-8e58-b40c85
executor2@272327[I]:install-requirements = False
executor2@272327[I]:🔧️ k8s-connection-pool = True
executor2@272327[I]:k8s-namespace = None
executor2@272327[I]:log-config = /usr/local/lib/python3.8/site-
executor2@272327[I]:memory-hwm = -1
executor2@272327[I]:🔧️ name = executor2
executor2@272327[I]:🔧️ native = True
executor2@272327[I]:🔧️ noblock-on-start = True
executor2@272327[I]:🔧️ num-part = 1
executor2@272327[I]:on-error-strategy = IGNORE
executor2@272327[I]:pea-role = SINGLETON
executor2@272327[I]:🔧️ port-ctrl = 55029
executor2@272327[I]:🔧️ port-in = 45357
executor2@272327[I]:port-jinad = 8000
executor2@272327[I]:🔧️ port-out = 52703
executor2@272327[I]:pull-latest = False
executor2@272327[I]:py-modules = None
executor2@272327[I]:quiet = False
executor2@272327[I]:quiet-error = False
executor2@272327[I]:quiet-remote-logs = False
executor2@272327[I]:replicas = 1
executor2@272327[I]:routing-table = None
executor2@272327[I]:🔧️ runs-in-docker = True
executor2@272327[I]:runtime-backend = PROCESS
executor2@272327[I]:runtime-cls = ZEDRuntime
⠴ 3/4 waiting executor1 to be ready... executor2@272327[I]:shard-id = 0
executor2@272327[I]:shards = 1
executor2@272327[I]:🔧️ socket-in = ROUTER_BIND
executor2@272327[I]:🔧️ socket-out = ROUTER_BIND
executor2@272327[I]:ssh-keyfile = None
executor2@272327[I]:ssh-password = None
executor2@272327[I]:ssh-server = None
executor2@272327[I]:static-routing-table = False
executor2@272327[I]:timeout-ctrl = 5000
executor2@272327[I]:timeout-ready = 600000
executor2@272327[I]:🔧️ upload-files = []
executor2@272327[I]:🔧️ uses = config.yml
executor2@272327[I]:🔧️ uses-metas = {'workspace': 'image-search'}
executor2@272327[I]:uses-requests = None
executor2@272327[I]:uses-with = None
executor2@272327[I]:volumes = None
executor2@272327[I]:🔧️ workspace = /home/matthew/Programming/Blog
executor2@272327[I]:🔧️ workspace-id = e2847d66-ec84-4dea-909d-2d65be
executor2@272327[I]:🔧️ zmq-identity = 82825c89-76fc-459d-8ab4-85a495
executor2@272327[I]:
executor2@272327[I]: executor2@ 1[L]: Executor SimpleIndexer started
executor2@272327[I]: JINA@ 1[W]:You are using Jina version 2.1.13, however version 2.2.1 is available. You should consider upgrading via the "pip install --upgrade jina" command.
Flow@217463[I]:🎉 Flow is ready to use!
🔗 Protocol: GRPC
🏠 Local access: 0.0.0.0:45755
🔒 Private network: 192.168.1.54:45755
🌐 Public address: 81.2.75.20:45755
executor2@272327[I]:UserWarning: It looks like you are trying to import multiple python modules using `py_modules`. When using multiple python files to define an executor, the recommended practice is to structure the files in a python package, and only import the `__init__.py` file of that package. For more details, please check out the cookbook: https://docs.jina.ai/fundamentals/executor/repository-structure/ (raised from /usr/local/lib/python3.8/site-packages/jina/jaml/helper.py:244)
executor2@272327[I]:UserWarning:
executor2@272327[I]:executor shadows one of built-in Python module name.
executor2@272327[I]:It is imported as `user_module.executor`
executor2@272327[I]:
executor2@272327[I]:Affects:
executor2@272327[I]:- Either, change your code from using `from executor import ...`
executor2@272327[I]:to `from user_module.executor import ...`
executor2@272327[I]:- Or, rename executor to another name
executor2@272327[I]: (raised from /usr/local/lib/python3.8/site-packages/jina/importer.py:111)
{'cosine': 0.7256965637207031}
{'cosine': 0.7519792914390564}
{'cosine': 0.7634004354476929}
{'cosine': 0.8040722608566284}
{'cosine': 0.8224889039993286}
Here you can see the document that is returned as the result. It then has a link to all of the documents in the index, which are stored in the matches
attribute. By iterating through the top 5 results we can see the images as well as the score they got. What surprises me is that the score of the top results is the lowest score. The top result is a strong match for the query though (a photo of a building) so it is working.
The document match documentation states that the match results always return from lowest to highest so maybe the cosine similarity has been mapped such that a perfect match is 0 and it increments from there. The cosine function is implemented here and is \(1 - \text{cosine similarity}\) which means it ranges from 0 (perfect match) to 2 (perfect mismatch).
That does mean that these results are not incredibly strong matches. CLIP is known to be sensitive to the prompt.
This concludes the initial investigation of Jina. I think that it has done well so far.
To make this into a “production” system I would need a more reliable indexer and to shard the index. There are several faiss executors available in the hub so it shouldn’t be too hard to do.
The initial results of the image search suggest that this should all work. The space and compute required are the real limitations.