Similarity functions
Programatically label data to curate datasets
Similarity Functions are used to programmatically identify similar or dissimilar data to label. While functions are typically less accurate than models or human annotators, they can quickly enrich data to help with curation and exploration.
Create a similarity function
Similarity Functions allow you to label data using embeddings. You can use Similarity Functions to find rare classes or patterns not described by metadata.
- In Catalog, select a single or multiple Data Rows
- Select View similar data rows
- Refine the results by fine tuning your data row selection and pressing Recompute similarity
- Select Create function
This will start a background task to process all of your Data Rows in Catalog.


Creating a new function from Catalog
Note
Once created, functions will auto process newly added data rows in Catalog.
Using functions filter
You can filter Data Rows by the similarity score computed by your function. Scores closer to 1 indicate greater similarity, whereas, scores closer to 0 indicate greater dissimilarity (or less similarity).
You can use multiple similarity functions in AND sequence. You can also use similarity function with all other filter types in Catalog.
More similar data rows


A "coastal images" function score range is set to show similar images. In this case, we see images containing coastline.
More dissimilar data rows


A "coastal images" function score range is set to show dissimilar images. In this case, we see deep sea or inland images.
Best practices
- Create functions using Data Rows that show the feature in different contexts or variations.
- More data examples do not necessarily improve results. Seek to balance the variations or contexts of the feature.
- Combine function filters with metadata to better guide sampling.
- Use lower thresholds in combination with random sampling to avoid overfitting and bias.
Updated 22 days ago