Shows how to upload custom embeddings to improve similarity search.
How to upload custom embeddings
Custom embeddings improve data exploration by improving similarity search.
You can upload up to ten (10) custom embedding types per workspace on any data type.
Use this to experiment with different embeddings to improve data selection.
Before you start
This example requires the following libraries:
# Starting from SDK version 3.69, custom embeddings are now supported.
import labelbox as lb
import numpy as np
import json
import uuid
import random
Replace API key
API_KEY = ""
client = lb.Client(API_KEY)
Select data rows
First, we need to fetch data rows from a Labelbox dataset.
To improve similarity search, you need to upload custom embeddings to at least 1,000 data rows.
dataset = client.get_dataset("<DATASET-ID>")
export_task = dataset.export()
export_task.wait_till_done()
data_rows = []
def json_stream_handler(output: lb.BufferedJsonConverterOutput):
data_row = output.json
data_rows.append(data_row)
if export_task.has_errors():
export_task.get_buffered_stream(
stream_type=lb.StreamType.ERRORS
).start(stream_handler=lambda error: print(error))
if export_task.has_result():
export_json = export_task.get_buffered_stream(
stream_type=lb.StreamType.RESULT
).start(stream_handler=json_stream_handler)
We need to extract the data row ID and the row data (asset URL)
data_row_dict = [{"data_row_id": dr["data_row"]["id"]} for dr in data_rows]
data_row_dict = data_row_dict[:1000] # keep the first 1000 examples for the sake of this demo
Create custom embedding payload
To prepare the data:
-
Generate random vectors for embeddings (max:
2048
dimensions)nb_data_rows = len(data_row_dict) print("Number of data rows: ", nb_data_rows) # Labelbox supports custom embedding vectors of up to 2048 dimensions custom_embeddings = [list(np.random.random(2048)) for _ in range(nb_data_rows)]
-
List custom embeddings in your Labelbox workspace:
embeddings = client.get_embeddings()
-
Choose an existing embedding type or create a new one
A unique custom embedding name is required as an argument for this method.# Name of the custom embedding must be unique embedding = client.create_embedding("my_custom_embedding_2048_dimensions", 2048)
-
Create payload
-
The payload should encompass the
key
(data row id or global key) and the new embedding vector data. Note that thedataset.upsert_data_rows()
operation will only update the values you pass in the payload; all other existing row data will not be modified.payload = [] for data_row_dict, custom_embedding in zip(data_row_dict,custom_embeddings): payload.append({"key": lb.UniqueId(data_row_dict['data_row_id']), "embeddings": [{"embedding_id": embedding.id, "vector": custom_embedding}]}) print('payload', len(payload),payload[:1])
Upload payload
-
Upsert data rows with custom embeddings
task = dataset.upsert_data_rows(payload) task.wait_till_done() print(task.errors) print(task.status)
-
Get the count of imported vectors for a custom embedding type
An updated count can take a few minutes, depending on the number of data rows associated with the embedding type.
count = embedding.get_imported_vector_count()
-
Delete custom embedding type.
embedding.delete()
Upload custom embeddings during data row creation
-
Create a dataset
# Create a dataset dataset_new = client.create_dataset(name="data_rows_with_embeddings")
-
Fetch an embedding type and create dummy vector data.
embedding = client.get_embedding_by_name("my_custom_embedding_2048_dimensions") vector = [random.uniform(1.0, 2.0) for _ in range(embedding.dims)]
-
Upload data rows with embeddings.
uploads = [] # Generate data rows for i in range(1,9): uploads.append({ "row_data": f"https://storage.googleapis.com/labelbox-datasets/People_Clothing_Segmentation/jpeg_images/IMAGES/img_000{i}.jpeg", "global_key": "TEST-ID-%id" % uuid.uuid1(), "embeddings": [{ "embedding_id": embedding.id, "vector": vector }] }) task1 = dataset_new.create_data_rows(uploads) task1.wait_till_done() print("ERRORS: " , task1.errors) print("RESULTS:" , task1.result)