Labelbox recommends that you spend time visualizing, curating, and debugging your training dataset before using it to train your machine learning model. Training datasets that are carefully visualized, curated, and debugged are the most successful for increasing model performance. However, these practices can be challenging to do on unstructured data (e.g., images, text, videos, pdf, etc) because unstructured data is not queryable with something like SQL.
Labelbox's similarity search tool is designed to help you programmatically identify similar or dissimilar data. You can use this tool to mine data looking for examples of rare assets or edge cases that will dramatically improve your model performance. This similarity search engine gives your team an edge by helping you find high-impact data rows in an ocean of data.
The alternative to using the Labelbox similarity search engine is to build your own similarity search tool. However, building an in-house similarity search tool that scales to hundreds of millions of data points—and that provides results instantaneously in just one click—is difficult for even the most advanced machine learning teams.
We recommend that you use the native similarity search engine within our Catalog product.
Similarity search is powered by vector embeddings. A vector embedding is a numerical representation of a piece of data (e.g., an image, text, document, or video) that translates high-dimensional data into a lower-dimensional space.
There are many ways to create vector embeddings for data. However, generating embeddings via neural networks is the most common and effective approach. Most neural networks are designed to produce structured outputs like bounding boxes or classifications. However, in order to produce the final prediction, a model will undergo a series of internal states before the final output is produced. Embeddings work by extracting the internal state and using it as a representation of the data row. In other words, the neural network acts as a feature extractor: it extracts an embedding vector that contains rich information about the data row.
The data rows that are used as input in a similarity search are called anchors. When you search for similar data rows, Labelbox surfaces and returns the data rows whose embeddings are closest to those of anchor data rows. Labelbox uses Manhattan distance (also known as L1 distance) to measure this.
Labelbox automatically computes off-the-shelf embeddings for the data types below. To do so, Labelbox uses neural networks trained on publicly available data. Off-the-shelf embeddings provide a useful starting point to explore your data and to do similarity searches.
Adding your own custom embeddings may improve your data exploration and similarity search experience. Support for uploading custom embeddings is coming soon.
To find an exhaustive list of embeddings for a data row, go to Catalog and open the detailed view of the data row. In the detailed view under Metadata, you will find all embeddings available on the data row.
Data rows that share common characteristics are represented by vectors that are close to each other in the embedding space. There are several ways to do a similarity search within the Catalog product.
In the gallery view of Catalog, you can find similar data rows in just one click. To do this, hover over a thumbnail and an icon will show up in the bottom-right corner of the thumbnail. Click on it to find similar data rows.
Once you've clicked on the similarity search icon, Labelbox will automatically populate the Similar to filter and display data rows that are similar to the anchor data row in the gallery view.
Alternatively, you can select multiple data rows as anchor data rows. To do this, go to the gallery view of Catalog, select multiple data rows, then click Similar to selection. Labelbox will then populate the gallery with similar data rows.
After you click Similar to search, Labelbox will automatically populate the Similar to filter and display data rows that are similar to the anchor data row in the gallery view.
While browsing through the results of a similarity search, Labelbox recommends refining the similarity search by adding more anchors. To do so, select the data rows of interest and click Add selection to anchors.
You may add up to 4 anchors per similarity search. This limit will be increased soon.
Once a similarity search filter has been populated, you can visualize the anchors associated with it by clicking on Anchors (n).
To remove one or more anchors from a similarity search, first visualize all anchors (see the previous paragraph), then hover on the thumbnail of the anchor to remove and click on the (—) icon.
Similarity search surfaces the data rows whose embeddings are closest to those of anchor data rows. This is measured using Manhattan distance (also known as L1 distance). It is a number between 0 and 1. The more similar the embeddings, the higher the similarity score.
By default, Labelbox returns embeddings with a similarity score between 0.85 and 1. You can customize this range by setting the minimum and maximum values of the similarity search slider.
Similarity search relies on a choice of embedding. You can decide which embedding to use for similarity search. To do this, select an embedding from the dropdown next to the Similar to filter.
All the off-the-shelf embeddings and the custom embeddings (coming soon) can be used to power the similarity search.
Currently, up to 4 anchors are supported per similarity search. Labelbox will be increasing this limit soon.
Some data types do have no off-the-shelf embeddings (see table here). For these data types, users need to upload custom embeddings (coming soon) to conduct similarity searches.
Since similarity search is a filter in Catalog, users can combine it with other types of filters (e.g., filters on metadata, annotations, datasets, projects, etc).
After populating filters in Catalog, you can save these filters as a slice of data. When you save a filter as a slice, you will not need to populate the same filters over and over again. Also, slices are dynamic, so any new incoming data row in Catalog will show up in the relevant slices.
Read through the following resources to learn how to take action on the filtered data.
Updated 3 days ago