Similarity search

Explore data with embeddings

Training datasets that are carefully visualized, curated, and debugged are the most impactful for increasing model performance. However, these practices can be challenging to do on unstructured data (e.g., images, text, videos, PDF, etc.) because unstructured data is not queryable with something like SQL.

Labelbox's similarity search tool is designed to help you programmatically identify similar or dissimilar data. You can use this tool to mine data and look for examples of rare assets or edge cases that will dramatically improve your model performance. This similarity search engine gives your team an advantage by helping you surface high-impact data rows in an ocean of data.

The alternative to using Labelbox's similarity search engine is to build your own similarity search tool. However, building an in-house similarity search tool that scales to hundreds of millions of data points β€” and that provides results instantaneously in just one click β€” is difficult for even the most advanced machine learning teams.

We recommend that you use the native similarity search engine within our Catalog product.

Use similarity search to programmatically surface similar data

Use similarity search to programmatically surface similar data.

How similarity search works

Similarity search is powered by vector embeddings. A vector embedding is a numerical representation of a piece of data (e.g., an image, text, document, or video) that translates high-dimensional data into a lower-dimensional space.

There are many ways to create vector embeddings for data. However, generating embeddings via neural networks is the most common and effective approach. Most neural networks are designed to produce structured outputs like bounding boxes or classifications.

However, in order to produce the final prediction, a model will undergo a series of internal states before the final output is produced. Embeddings work by extracting the internal state and using it as a representation of the data row. In other words, the neural network acts as a feature extractor: it extracts an embedding vector that contains rich information about the data row.

The data rows that are used as input in a similarity search are called anchors. When you search for similar data rows, Labelbox surfaces and returns the data rows whose embeddings are closest to those of anchor data rows. Labelbox uses Manhattan distance (also known as L1 distance) to measure similarity.

Supported embeddings

Labelbox automatically computes off-the-shelf embeddings for the data types noted below. To do so, Labelbox uses neural networks trained on publicly available data. Off-the-shelf embeddings provide a useful starting point to explore your data and perform similarity searches.

Asset typeOff-the-shelfCustom
ImageCLIP-ViT-B-32 (512 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Video-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Textall-mpnet-base-v2 (768 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
HTMLall-mpnet-base-v2 (768 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
DocumentCLIP-ViT-B-32 (512 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Tiled imageryCLIP-ViT-B-32 (512 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
DICOM-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Audio-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Conversational-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.

πŸ“˜

Custom embeddings

Adding your own custom embeddings may improve your data exploration and similarity search experience.

You can upload up to 100 different custom embeddings β€” on any kind of data β€” to Labelbox. This enables you to experiment with different embeddings to power your data selection. Learn how to upload custom embeddings here .

View available embeddings on a data row

To find an exhaustive list of embeddings for a data row, go to Catalog and open the detailed view of the data row. In the detailed view under Metadata, you will find all embeddings available on the data row.

The detailed view of Catalog shows the list of embeddings available on a data row

The detailed view of Catalog shows the list of embeddings available on a data row.

How to search for similar data rows

Data rows that share common characteristics are represented by vectors that are close to each other in the embedding space. There are several ways to do a similarity search within the Catalog product.

Select the initial anchor

In the gallery view of Catalog, you can find similar data rows in just one click. To do this, hover over a thumbnail and an icon will show up in the bottom-right corner of the thumbnail. Click on it to find similar data rows.

Launch a similarity search by clicking on the thumbnail

Launch a similarity search by clicking on the thumbnail.

Once you've clicked on the similarity search icon, Labelbox will automatically populate the Similar to filter and display data rows that are similar to the anchor data row in the gallery view.

A similarity search shows up as a filter in Catalog

A similarity search shows up as a filter in Catalog.

Select multiple initial anchors

Alternatively, you can select multiple data rows as anchor data rows. To do this, go to the gallery view of Catalog, select multiple data rows, then click Similar to selection. Labelbox will then populate the gallery with similar data rows.

Launch a similarity search using multiple anchors

Launch a similarity search using multiple anchors

After you click Similar to search, Labelbox will automatically populate the Similar to filter and display data rows that are similar to the anchor data row in the gallery view.

Refine the similarity search

There are several approaches and tools available to powerfully refine your initial similarity searches.

Add anchors

While browsing through the results of a similarity search, Labelbox recommends refining the similarity search by adding more anchors. To do so, select the data rows of interest and click Add selection to anchors.

Add selected images as anchors to refine your similarity search

Add selected images as anchors to refine your similarity search.

πŸ“˜

Anchor limit

You may add up to 20 anchors per similarity search. This limit will be increased soon.

Visualize anchors

Once a similarity search filter has been populated, you can visualize the anchors associated with it by clicking on Anchors (n).

Click **Anchors (n)** to see the anchors

Click Anchors (n) to see the anchors.

Remove anchors

To remove one or more anchors from a similarity search, first visualize all anchors (see the previous paragraph), then hover on the thumbnail of the anchor to remove and click on the (β€”) icon.

Remove one or more anchors from a similarity search

Remove one or more anchors from a similarity search.

Customize the range of similarity scores

Similarity search surfaces the data rows whose embeddings are closest to those of anchor data rows. This is measured using Manhattan distance (also known as L1 distance). It is a number between 0 and 1. The more similar the embeddings, the higher the similarity score.

By default, Labelbox returns embeddings with a similarity score between 0.85 and 1. You can customize this range by setting the minimum and maximum values of the similarity search slider.

Customize the results of the similarity search by tuning similarity scores

Customize the results of the similarity search by specifying the range of similarity scores.

Specify an embedding

Similarity search relies on a choice of embedding. You can decide which embedding to use for similarity search. To do this, select an embedding from the dropdown in the Similar to filter.

Specify the embedding that powers the similarity search

Specify the embedding that powers the similarity search.

All off-the-shelf embeddings and custom embeddings can be used to power the similarity search.

Combine similarity search with other filters

You can combine similarity search with other filters in Catalog. Some filters are best used for targeting unstructured data and others are best for targeting structured data.

Combine natural language search with the following filters to target data rows by structured data:

Combine natural language search with the following filters to search unstructured data:

Similarity search can be used in conjunction with other filters, to surface high impact data

Similarity search can be used in conjunction with other filters to surface high-impact data.

Automate data curation with slices

After populating filters in Catalog, you can save these filters as a slice of data. When you save a filter as a slice, you will not need to populate the same filters over and over again. Slices are dynamic, thus any incoming data rows in Catalog will show up in the relevant slices.

Read through the following resources to learn how to take action on the filtered data:

How to upload custom embeddings

You can improve your data exploration and similarity search experience by adding your own custom embeddings. Labelbox allows you to upload up to 100 different custom embeddings on any kind of data. You can experiment with different embeddings to power your data selection.

🚧

Python SDK support coming soon

In May 2023, you will be able to upload custom embeddings via the Python SDK. Meanwhile, here is a temporary solution to upload custom embeddings to Labelbox.

Step 1: Install the package

This Github package is built and maintained by Labelbox. ADVLib is a basic library and command line tool for importing custom embeddings into Labelbox. Before you can upload custom embeddings, you'll need to install this package.

pip3 install -q 'git+https://github.com/Labelbox/advlib.git'

Step 2: Set up the API key

In order to upload custom embeddings, you must have a Labelbox API key stored in the environment in one of two ways

  • LABELBOX_API_KEY - The API key itself
  • LABELBOX_API_KEY_FILE - The path to a file containing the Labelbox API key.

Step 3: Create a custom embedding type

πŸ“˜

Minumum 1000 custom embedding vectors

You must upload at least 1000 feature vectors for similarity search to function in Catalog.

Create a custom embedding type

Use this command to create a custom embedding type:

advtool embeddings create <NAME> <N DIMENSIONS>
FieldDefinition
<NAME>This is the name of your custom embedding type. It can be any string.
<N DIMENSIONS>This indicates the dimensionality of your custom embedding type. It must be an integer between 8 and 2048.

This will output the ID of the newly created custom embedding type.

List existing custom embedding types

After you create your custom embedding type, use this command to check whether it exists.

advtool embeddings list

Create the payload for custom embeddings

The payload should be a .ndjson file. It should have the following format. Every line corresponds to a specific custom embedding vector on a specific data row.

{"id": <DATA ROW ID>, "vector": [some floats]}
FieldDescription
<DATA ROW ID>ID of the data row.
[some floats]The custom embedding vector. It must have the number of dimensions specified in the custom embedding type (between 8 and 2048).

Here is an example .ndjson file.

{"id": "clabk7ly90gmg076ag72l44c9", "vector": [2.58, -7.05, -4.01, -20.93, 11.36, -13.46, -0.055, 13.8]},
{"id": "clabk7lzg0ifs07b50zqs0btq", "vector": [0.05, 16.29, -16.11, -8.05, -2.67, -11.53, -4.52, -0.60]},

Upload the payload to Labelbox

advtool embeddings import <EMB ID> <NDJSON FILE>
FieldDescription
<EMB ID>Embedding ID
<NDJSON FILE>The .ndjson file that contains the payload

Count the number of vectors uploaded

You can get a count of the number of vectors uploaded for a specific custom embedding. <EMB ID> is the embedding ID.

advtool embeddings count <EMB ID>
FieldDescription
<EMB ID>Embedding ID

Delete a custom embedding type

You can delete a custom embedding type. <EMB ID> is the embedding ID.

advtool embeddings delete <EMB ID>

Steps 1-3: End-to-end Python tutorial

Check out this end-to-end Python tutorial to see how to upload custom embeddings to Labelbox (Steps 1-7).