Similarity Functions are used to programmatically identify similar or dissimilar data to label. While functions are typically less accurate than models or human annotators, they can quickly enrich data to help with curation and exploration.
Similarity Functions allow you to label data using embeddings. You can use Similarity Functions to find rare classes or patterns not described by metadata.
- In Catalog, select a single or multiple Data Rows
- Select View similar data rows
- Refine the results by fine tuning your data row selection and pressing Recompute similarity
- Select Create function
This will start a background task to process all of your Data Rows in Catalog.
Once created, functions will auto process newly added data rows in Catalog.
You can filter Data Rows by the similarity score computed by your function. Scores closer to 1 indicate greater similarity, whereas, scores closer to 0 indicate greater dissimilarity (or less similarity).
You can use multiple similarity functions in AND sequence. You can also use similarity function with all other filter types in Catalog.
More similar data rows
More dissimilar data rows
- Create functions using Data Rows that show the feature in different contexts or variations.
- More data examples do not necessarily improve results. Seek to balance the variations or contexts of the feature.
- Combine function filters with metadata to better guide sampling.
- Use lower thresholds in combination with random sampling to avoid overfitting and bias.
Updated 22 days ago