Software Engineering Blog by Anton Kolhun. Блог, Антон Колгун

Efficient Full Text Search With Vespa: Configure Text To Image Search With CLIP Models

25 Oct 2024

Text-to-image search allows users to enter a text description and retrieve matching images based on that description. For a example, "couple drinking coffee" query might return the image as follows:

text_to_image

This guide shows how to set up a text-to-image search in Vespa using the CLIP model.

Goal

Configure text-to-image search using CLIP clip-image-vit-32 model to enable retrieval of relevant images based on textual input.

Prerequisites

Ensure Configure Semantic Search is completed.
Docker-compose installed.
JDK 17+ installed.
Maven 3+ installed.

Add Fields and Rank Profile

Define image_url and vit_b_32_image fields in the schema:

field image_url type string {
    indexing:  summary
}

field vit_b_32_image type tensor<float>(x[512]) {
    indexing: attribute | index | summary
    attribute {
        distance-metric: euclidean
    }
    index {
        hnsw {
            max-links-per-node: 16
            neighbors-to-explore-at-insert: 200
        }
    }
}

where

image_url is filled during feeding with values like https://ak-pub-images.s3.eu-central-1.amazonaws.com/pexels-photo-2735970.jpeg.
vit_b_32_image is populated automatically by ImageEmbedderProcessor component, which creates a tensor based on the image content in image_url.

Create a new rank profile:

rank-profile vit_b_32_similarity inherits default {
    inputs {
      query(vit_b_32_text) tensor<float>(x[512])
    }

    first-phase {
      expression: closeness(label, nns)
    }
  }

In this setup, the label nns is defined in the query profile, as shown below

Create book_image_v1 query-profile:

<?xml version="1.0" encoding="UTF-8"?>
<query-profile id="book_image_v1">
    <field name="maxHits">100</field>
    <field name="maxOffset">100</field>
    <field name="hits">10</field>
    <field name="ranking.profile">vit_b_32_similarity</field>
    <field name="text_embedding_enabled">true</field>

    <field name="yql">select * from book where
        ({"targetHits": 100, "label": "nns"}nearestNeighbor(vit_b_32_image, vit_b_32_text))
        %{query_filter}
    </field>

    <field name="timeout">2s</field>
    <field name="rules.off">false</field>
</query-profile>

Here, the vit_b_32_text parameter is populated by TextEmbeddingSearcher component which computes the embedding based on the search_term parameter

Setting Up Models in VAP

CLIP includes two models: one for text and one for images. They are trained together by pairing images with text descriptions.
Download image clip_image_vit_b_32_v1.onnx and text clip_text_vit_b_32_v1.onnx models. Place them in the ./models folder and rename to clip_image_vit_b_32_v1.onnx and clip_text_vit_b_32_v1.onnx respectively.

Creating Components To Utilize The Models

Add the necessary codebase components:
- ImageEmbedderProcessor: Uses clip_image_vit_b_32_v1.onnx to convert image content into tensors, which it stores in a specified document field during feeding
- TextEmbeddingSearcher - Uses clip_text_vit_b_32_v1.onnx to convert text-based search queries into tensors for ranking calculations.
- BPETokenizer - Tokenizer used by TextEmbeddingSearcher
Add Math-Engine application that exposes an HTTP endpoint to download images from specified URLs and convert them to tensors. This functionality was adapted from openai clip codebase based on the following guideline.
ImageEmbedderProcessor calls this endpoint to calculate image embedding.
Define necessary changes in services.xml:

 <model-evaluation/>
 ...
<documentprocessor id="edu.component.ImageEmbedderProcessor" bundle="text-image-search">
    <config name="edu.component.image-embedder">
        <modelName>clip_image_vit_b_32_v1</modelName>
        <schemaToFieldsCfg>
            <item key="book">image_url,vit_b_32_image</item>
        </schemaToFieldsCfg>
    </config>
</documentprocessor>
...
<component id="edu.component.BPETokenizer" bundle="text-image-search">
    <config name="edu.component.bpe-tokenizer">
        <contextlength>77</contextlength>
        <vocabulary>files/bpe_simple_vocab_16e6.txt.gz</vocabulary>
    </config>
</component>
...

<searcher id='edu.component.TextEmbeddingSearcher' bundle="text-image-search">
    <config name="edu.component.text-embedder">
        <modelName>clip_text_vit_b_32_v1</modelName>
        <rankFeatureParam>query(vit_b_32_text)</rankFeatureParam>
    </config>
</searcher>

Place the bpe_simple_vocab_16e6.txt.gz vocabulary in the /book/files folder (it’s used by BPETokenizer)

Build and Package Components into VAP

Package the components with the vespa maven plugin:

mvn package

Copy the generated ./target/application/components/text-image-search-deploy.jar file to the ./book/components folder

Setting Up Docker-Compose and Deploying the VAP

Start docker compose:

docker-compose up -d

This will launch the Vespa-Engine and Math-Engine applications as defined in docker-compose.yml

Deploy VAP:

vespa deploy book

Feed The Documents
Populate image_url field in the test documents with appropriate image URLs and then run:

vespa feed book/ext/docs.json

Execute Search

To test the setup, perform a search for "boat":

curl --location 'http://localhost:8080/search/' \
--header 'Content-Type: application/json' \
--data '{
    "queryProfile": "book_image_v1",
    "search_term": "boat"
}'

The search should return results where "The Open Boat" receives the highest rank.

Summary

By following these steps, we have successfully configured a text-to-image search in Vespa, enabling accurate and efficient image retrieval based on textual descriptions.

Next Steps

Explore Implementing PII Search

Efficient Full Text Search With Vespa: Configure Text To Image Search With CLIP Models

Goal

Prerequisites

Add Fields and Rank Profile

Create book_image_v1 query-profile:

Setting Up Models in VAP

Creating Components To Utilize The Models

Build and Package Components into VAP

Setting Up Docker-Compose and Deploying the VAP

Execute Search

Summary

Next Steps

Comments

Leave a comment

Efficient Full Text Search With Vespa: Configure Text To Image Search With CLIP Models

Goal

Prerequisites

Add Fields and Rank Profile

Create book_image_v1 query-profile:

Setting Up Models in VAP

Creating Components To Utilize The Models

Build and Package Components into VAP

Setting Up Docker-Compose and Deploying the VAP

Execute Search

Summary

Next Steps

Comments

Leave a comment

New posts delivered straight to your inbox.