> ## Documentation Index
> Fetch the complete documentation index at: https://docs.labelbox.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom embeddings

> Shows how to upload custom embeddings to improve similarity search.

<CardGroup cols={2}>
  <Card title="Open in Colab" icon="infinity" iconType="solid" horizontal href="https://colab.research.google.com/github/Labelbox/labelbox-notebooks/blob/main/basics/custom_embeddings.ipynb" />

  <Card title="GitHub" icon="github" iconType="solid" horizontal href="https://github.com/Labelbox/labelbox-notebooks/blob/main/basics/custom_embeddings.ipynb" />
</CardGroup>

## How to upload custom embeddings

Custom embeddings improve data exploration by improving [similarity search](/docs/similarity).

You can upload up to ten (10) custom embedding types per workspace on any data type.

Use this to experiment with different embeddings to improve data selection.

## Before you start

This example requires the following libraries:

<CodeGroup>
  ```python Install theme={null}
  # Starting from SDK version 3.69, custom embeddings are now supported.
  import labelbox as lb
  import numpy as np
  import json
  import uuid
  import random
  ```
</CodeGroup>

## Replace API key

<CodeGroup>
  ```python Python theme={null}
  API_KEY = ""
  client = lb.Client(API_KEY)
  ```
</CodeGroup>

## Select data rows

First, we need to fetch data rows from a Labelbox dataset.

To improve similarity search, you need to upload custom embeddings to at least 1,000 data rows.

<CodeGroup>
  ```python Python theme={null}
  dataset = client.get_dataset("<DATASET-ID>")

  export_task = dataset.export()
  export_task.wait_till_done()

  data_rows = []

  # Stream the export using a callback function

  def json_stream_handler(output: labelbox.BufferedJsonConverterOutput):
  print(output.json)

  export_task.get_buffered_stream(stream_type=labelbox.StreamType.RESULT).start(stream_handler=json_stream_handler)

  # Collect all exported data into a list

  export_json = [data_row.json for data_row in export_task.get_buffered_stream()]

  ```
</CodeGroup>

Extract the data row ID and the row data (asset URL):

<CodeGroup>
  ```python Python theme={null}
  data_row_dict = [{"data_row_id": dr["data_row"]["id"]} for dr in data_rows]
  data_row_dict = data_row_dict[:1000] # keep the first 1000 examples for the sake of this demo
  ```
</CodeGroup>

## Create custom embedding payload

To prepare the data:

<Steps>
  <Step>
    Generate random vectors for embeddings (max: `2048` dimensions)

    <CodeGroup>
      ```python Python theme={null}
      nb_data_rows = len(data_row_dict)
      print("Number of data rows: ", nb_data_rows)
      # Labelbox supports custom embedding vectors of up to 2048 dimensions
      custom_embeddings = [list(np.random.random(2048)) for _ in range(nb_data_rows)]
      ```
    </CodeGroup>
  </Step>

  <Step>
    List custom embeddings in your Labelbox workspace:

    <CodeGroup>
      ```python Python theme={null}
      embeddings = client.get_embeddings()
      ```
    </CodeGroup>
  </Step>

  <Step>
    Choose an existing embedding type or create a new one A unique custom embedding name is required as an argument for this method.

    <CodeGroup>
      ```python Python theme={null}
      # Name of the custom embedding must be unique
      embedding = client.create_embedding("my_custom_embedding_2048_dimensions", 2048)
      ```
    </CodeGroup>
  </Step>

  <Step>
    Create payload

    * The payload should encompass the `key` (data row id or global key) and the new embedding vector data. Note that the `dataset.upsert_data_rows()` operation will only update the values you pass in the payload; all other existing row data will not be modified.

          <CodeGroup>
            ```python Python theme={null}
            payload = []
            for data_row_dict, custom_embedding in zip(data_row_dict,custom_embeddings):
              payload.append({"key": lb.UniqueId(data_row_dict['data_row_id']),
                              "embeddings": [{"embedding_id": embedding.id, "vector": custom_embedding}]})

              print('payload', len(payload),payload[:1])
            ```
          </CodeGroup>
  </Step>
</Steps>

## Upload payload

<Steps>
  <Step>
    Upsert data rows with custom embeddings

    <CodeGroup>
      ```python Python theme={null}
      task = dataset.upsert_data_rows(payload)
      task.wait_till_done()
      print(task.errors)
      print(task.status)
      ```
    </CodeGroup>
  </Step>

  <Step>
    Get the count of imported vectors for a custom embedding type

    An updated count can take a few minutes, depending on the number of data rows associated with the embedding type.

    <CodeGroup>
      ```python Python theme={null}
      count = embedding.get_imported_vector_count()
      ```
    </CodeGroup>
  </Step>

  <Step>
    Delete custom embedding type.

    <CodeGroup>
      ```python Python theme={null}
      embedding.delete()
      ```
    </CodeGroup>
  </Step>
</Steps>

## Upload custom embeddings during data row creation

<Steps>
  <Step>
    Create a dataset

    <CodeGroup>
      ```python Python theme={null}
      # Create a dataset
      dataset_new = client.create_dataset(name="data_rows_with_embeddings")
      ```
    </CodeGroup>
  </Step>

  <Step>
    Fetch an embedding type and create dummy vector data.

    <CodeGroup>
      ```python Python theme={null}
      embedding = client.get_embedding_by_name("my_custom_embedding_2048_dimensions")
      vector = [random.uniform(1.0, 2.0) for _ in range(embedding.dims)]
      ```
    </CodeGroup>
  </Step>

  <Step>
    Upload data rows with embeddings.

    <CodeGroup>
      ```python Python theme={null}
      uploads = []
      # Generate data rows
      for i in range(1,9):
          uploads.append({
              "row_data":  f"https://storage.googleapis.com/labelbox-datasets/People_Clothing_Segmentation/jpeg_images/IMAGES/img_000{i}.jpeg",
              "global_key": "TEST-ID-%id" % uuid.uuid1(),
              "embeddings": [{
                          "embedding_id": embedding.id,
                          "vector": vector
                      }]
          })

      task1 = dataset_new.create_data_rows(uploads)
      task1.wait_till_done()
      print("ERRORS: " , task1.errors)
      print("RESULTS:" , task1.result)
      ```
    </CodeGroup>
  </Step>
</Steps>
