Natural language search

A guide for using natural language search in Catalog.

You can use Labelbox's natural language search to surface data rows that match any expression you provide. This natural language search engine gives your team an edge by helping you find high-impact data rows in an ocean of data (e.g., rare data or edge cases).

We recommend that you use the native natural language search engine within our Catalog product.

How NL search works

Natural language search is powered by vector embeddings. A vector embedding is a numerical representation of a piece of data (e.g., an image, text, document, or video) that translates the raw data into a lower-dimensional space.

Recent advances in the machine learning field enable some neural networks (e.g., CLIP by OpenAI) to recognize a wide variety of visual concepts in images and associate them with keywords.

You can now surface images in Catalog by describing them in natural language. For example, type in "a photo of birds in the sunset" to surface images of birds in the sunset.

📘

Character limit

You can enter a maximum of 100 characters or a maximum of 10 words when using the Find text filter.

Supported embeddings

Labelbox automatically computes off-the-shelf CLIP embeddings for the media types listed below. Custom embeddings are supported for every data modality.

Asset typeOff-the-shelfCustom
ImageCLIP-ViT-B-32 (512 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Video-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Textall-mpnet-base-v2 (768 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
HTMLall-mpnet-base-v2 (768 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
DocumentCLIP-ViT-B-32 (512 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Tiled imageryCLIP-ViT-B-32 (512 dimensions)Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
DICOM-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Audio-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.
Conversational-Up to 2048 dimensions per embedding; up to 100 custom embeddings per workspace.

How to search data using NL

In the gallery view of Catalog, add a Natural Language filter. Then, input the description of the data you are looking for. The description must have at least 3 characters and at most 10 words.

Prompt engineering

Prompt engineering is the action of trying several prompts until finding one that works well. Labelbox recommends trying several natural language descriptions (or prompts) until the natural language search surfaces the data you are looking for. Users have reported that small tweaks to the prompt can help in returning more relevant data.

Set the score range

Natural language search surfaces the data rows whose embeddings are closest to those of the prompt. This is measured using cosine distance, which is a number between 0 and 1. The more similar the embeddings, the higher the natural language score.

By default, Labelbox returns embeddings with a natural language score between 0.5 and 1. You can customize this range by setting the minimum and maximum values of the natural language search slider.

Customize the results of the natural language search by specifying the range of scores

Customize the results of the natural language search by specifying the range of scores.

Combine natural language search & other searches

You can combine natural language search with other filters in Catalog. Some filters are best used for targeting unstructured data and others are best for targeting structured data.

Combine natural language search with the following filters to target data rows by structured data:

Combine natural language search with the following filters to search unstructured data:

Natural language search can be used in conjunction with other filters to surface high-impact data

Natural language search can be used in conjunction with other filters to surface high-impact data.

Automate data curation with slices

After populating filters in Catalog, you can save these filters as a slice of data. When you save a filter as a slice, you will not need to populate the same filters over and over again. Also, slices are dynamic, so any new incoming data row in Catalog will show up in the relevant slices.

Read through the following resources to learn how to take action on the filtered data.