Active learning overview

Generally speaking, machine learning (ML) teams have orders of magnitude more unlabeled data than labeled data. Without a strategy for selecting the "right" data to label, creating training data can become a bottleneck and very expensive. In order to effectively improve your model performance and minimize labeling costs, it is critical to have a clear approach to determine which pieces of data to label. Fortunately, some best practices for data selection exist.

Active learning is the practice of targeting the areas that your ML model that need the most improvement by giving it training data that contains only the classifications that it struggles most to predict. When you use Active learning to select which of your data you are going to label next, you can cut your labeling time and resources significantly.

You can also leverage metadata to select data intelligently. Metadata can include information such as the capture date, capture location, the make or model of the sensor, or even the weather at the location.

There are two approaches to determining which data to label:

Approach

When to use it

Class distribution

I don’t have enough examples of X and I know my model will fail.

Model performance

I trained a model and the model is lower performing on X or in X situations.


What’s Next

Learn about Batch Queues and Catalog

Did this page help you?