# Embeddings/similarity (beta)

Python Tutorial | Github | Google Colab |
---|---|---|

Embeddings setup |

Unlike structured data, unstructured data is inherently more challenging to explore and interpret. Because it is not queryable with SQL, it is difficult to transform it into interpretable metrics and visualizations. Labelbox allows you to add embeddings to your Data Rows so you can easily query and explore your unstructured data.

Embeddings are useful for developing a more holistic understanding of your training data. This feature allows you to leverage the inferences from your ML model to automatically generate groups of similar data. You can use this tool to more effectively select data for your Model Diagnostics workflow.

# How it works

An embedding is a numerical representation of a piece of data (e.g., an image, document, video) that serves to translate high-dimensional data into a low-dimensional space. When you add embeddings to your Data Rows, Labelbox will translate those Data Rows into points in an embedding space and present to you clusters of semantically similar points (i.e., groups of similar Data Rows). The similarity between a set of Data Rows is determined by the position (distance and direction) of their points in the embedding space. Points that are semantically similar in the vector space will form clusters, indicating that they contain some common characteristics.

There are many ways to create embeddings for data. However, generating embeddings via neural networks is the most common and effective approach.

Most neural networks are designed to produce structured outputs like bounding boxes or classifications. However, in order to produce the final prediction, a model will undergo a series of internal states before the final output is produced. Embeddings work by extracting the internal state information to use as a representation of the datum rather than the final prediction.

Labelbox uses neural networks trained on publicly available data to compute embeddings for many of the data types uploaded to the platform. While these off-the-shelf models provide a useful starting point for exploring your data, you will find that the quality of exploration significantly improves when you upload embeddings of your own.

# Embedding projector

The embedding projector is a tool for uncovering patterns in unstructured data that can be used to diagnose systemic model and labeling errors. The embedding projector works for Data Rows that have embeddings. Most embeddings have much higher dimensions than 2 or 3, making it impossible to identify global patterns. You can use the projector tool to employ dimensionality reduction algorithms to explore embeddings in 2D interactively. You can navigate to the embedding projector by using the icons in the top right.

We support two algorithms for dimensionality reduction, PCA & UMAP, which can be toggled. PCA is much faster but does not attempt to find clusters in the data. UMAP is better for trying to find regions of similar points in the data. The sphereize data normalizes the data by subtracting the mean and dividing by the norm.

Method | Speed | Clustering |
---|---|---|

PCA: Principal Component Analysis | Fast | No |

UMAP: Uniform Manifold Approximation and Projection | Slow | Yes |

Use the selection tool to isolate data rows for further investigation. To drill into a set of data rows further, you can click the selected data rows button in the top right and click Filter to select. This will re-run the dimensionality reduction algorithm you selected on the subset of data rows to surface local clusters in the data.

Updated 23 days ago