Efficient Full Text Search With Vespa: Configure Text To Image Search With CLIP Models

cal 25 Oct 2024

Text-to-image search allows users to enter a text description and retrieve matching images based on that description. For a example, "couple drinking coffee" query might return the image as follows:

text_to_image

This guide shows how to set up a text-to-image search in Vespa using the CLIP model.

Goal

Configure text-to-image search using CLIP clip-image-vit-32 model to enable retrieval of relevant images based on textual input.

Prerequisites

Add Fields and Rank Profile

Define image_url and vit_b_32_image fields in the schema:

field image_url type string {
    indexing:  summary
}

field vit_b_32_image type tensor<float>(x[512]) {
    indexing: attribute | index | summary
    attribute {
        distance-metric: euclidean
    }
    index {
        hnsw {
            max-links-per-node: 16
            neighbors-to-explore-at-insert: 200
        }
    }
}

where

Create a new rank profile:

rank-profile vit_b_32_similarity inherits default {
    inputs {
      query(vit_b_32_text) tensor<float>(x[512])
    }

    first-phase {
      expression: closeness(label, nns)
    }
  }

In this setup, the label nns is defined in the query profile, as shown below

Create book_image_v1 query-profile:

<?xml version="1.0" encoding="UTF-8"?>
<query-profile id="book_image_v1">
    <field name="maxHits">100</field>
    <field name="maxOffset">100</field>
    <field name="hits">10</field>
    <field name="ranking.profile">vit_b_32_similarity</field>
    <field name="text_embedding_enabled">true</field>

    <field name="yql">select * from book where
        ({"targetHits": 100, "label": "nns"}nearestNeighbor(vit_b_32_image, vit_b_32_text))
        %{query_filter}
    </field>

    <field name="timeout">2s</field>
    <field name="rules.off">false</field>
</query-profile>

Here, the vit_b_32_text parameter is populated by TextEmbeddingSearcher component which computes the embedding based on the search_term parameter

Setting Up Models in VAP

CLIP includes two models: one for text and one for images. They are trained together by pairing images with text descriptions.
Download image clip_image_vit_b_32_v1.onnx and text clip_text_vit_b_32_v1.onnx models. Place them in the ./models folder and rename to clip_image_vit_b_32_v1.onnx and clip_text_vit_b_32_v1.onnx respectively.

Creating Components To Utilize The Models

  • Add the necessary codebase components:

    • ImageEmbedderProcessor: Uses clip_image_vit_b_32_v1.onnx to convert image content into tensors, which it stores in a specified document field during feeding

    • TextEmbeddingSearcher - Uses clip_text_vit_b_32_v1.onnx to convert text-based search queries into tensors for ranking calculations.

    • BPETokenizer - Tokenizer used by TextEmbeddingSearcher

  • Add Math-Engine application that exposes an HTTP endpoint to download images from specified URLs and convert them to tensors. This functionality was adapted from openai clip codebase based on the following guideline.
    ImageEmbedderProcessor calls this endpoint to calculate image embedding.

  • Define necessary changes in services.xml:

 <model-evaluation/>
 ...
<documentprocessor id="edu.component.ImageEmbedderProcessor" bundle="text-image-search">
    <config name="edu.component.image-embedder">
        <modelName>clip_image_vit_b_32_v1</modelName>
        <schemaToFieldsCfg>
            <item key="book">image_url,vit_b_32_image</item>
        </schemaToFieldsCfg>
    </config>
</documentprocessor>
...
<component id="edu.component.BPETokenizer" bundle="text-image-search">
    <config name="edu.component.bpe-tokenizer">
        <contextlength>77</contextlength>
        <vocabulary>files/bpe_simple_vocab_16e6.txt.gz</vocabulary>
    </config>
</component>
...

<searcher id='edu.component.TextEmbeddingSearcher' bundle="text-image-search">
    <config name="edu.component.text-embedder">
        <modelName>clip_text_vit_b_32_v1</modelName>
        <rankFeatureParam>query(vit_b_32_text)</rankFeatureParam>
    </config>
</searcher>

Build and Package Components into VAP

  • Package the components with the vespa maven plugin:

mvn package
  • Copy the generated ./target/application/components/text-image-search-deploy.jar file to the ./book/components folder

Setting Up Docker-Compose and Deploying the VAP

  • Start docker compose:

docker-compose up -d

This will launch the Vespa-Engine and Math-Engine applications as defined in docker-compose.yml

  • Deploy VAP:

vespa deploy book
  • Feed The Documents
    Populate image_url field in the test documents with appropriate image URLs and then run:

vespa feed book/ext/docs.json

To test the setup, perform a search for "boat":

curl --location 'http://localhost:8080/search/' \
--header 'Content-Type: application/json' \
--data '{
    "queryProfile": "book_image_v1",
    "search_term": "boat"
}'

The search should return results where "The Open Boat" receives the highest rank.

Summary

By following these steps, we have successfully configured a text-to-image search in Vespa, enabling accurate and efficient image retrieval based on textual descriptions.

Next Steps