Similarity search

Explore data with embeddings

Unlike structured data, unstructured data is inherently more challenging to explore and interpret. Because it is not queryable with something like SQL, it is difficult to transform it into interpretable metrics.

With embeddings, you can easily query and explore your unstructured data. Embeddings are useful for developing a more holistic understanding of your training data.

Labelbox uses embeddings to power Similarity Search. Similarity Search allows you to leverage the inferences from your ML model to automatically generate groups of similar data. You can use Similarity Search to more efficiently select data to improve your model performance.

Labelbox automatically pre-computes embeddings for these data types:

  • Image
  • Text
  • Geospatial tiled


Model embeddings

Upload your own model embeddings for the best experience.

How embeddings work

An embedding is a numerical representation of a piece of data (e.g., an image, document, video) that serves to translate high-dimensional data into a low-dimensional space.

When you add embeddings to your Data Rows, Labelbox will translate those Data Rows into points in an embedding space and present to you clusters of semantically similar points (i.e., groups of similar Data Rows). The similarity between a set of Data Rows is determined by the position (distance and direction) of their points in the embedding space. Points that are semantically similar in the vector space will form clusters, indicating that they contain some common characteristics.

There are many ways to create embeddings for data. However, generating embeddings via neural networks is the most common and effective approach.

Most neural networks are designed to produce structured outputs like bounding boxes or classifications. However, in order to produce the final prediction, a model will undergo a series of internal states before the final output is produced. Embeddings work by extracting the internal state information to use as a representation of the datum rather than the final prediction.

Labelbox uses neural networks trained on publicly available data to compute embeddings for many of the data types uploaded to the platform. While these off-the-shelf models provide a useful starting point for exploring your data, you will find that the quality of exploration significantly improves when you upload embeddings of your own.

Complete Python SDK tutorial

Did this page help you?