Curate data splits

Overview

It is best practice to split the labeled data set into three sets: training, validation, test. Doing so greatly reduces your chances of overfitting your model. The visualization below roughly shows the recommended allocation of Data Rows to each set.

Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has "passed" the validation set. The following figure illustrates this workflow.

Set and re-use data splits

Once you have created a model run inside a model, the model run will access all the data rows selected for training from the Create a model step. From here, you can configure the train, validation and test splits.

  1. The default data split is 80% training, 10% validation, and 10% testing. You can adjust the data splits by using the slider, or typing in the input field. If you have a previous model run within the model directory, you can choose to load from the previous config. You also need to name the model run, such as “model iteration 1”.

Next, click Create model run.

  1. Now you should be able to see the annotations on the train, validation, and test splits (you might need to wait for a few seconds and refresh it for the UI to finish loading all data rows). From here, you can view the annotations from each data split.

If you want to move some data rows from one split to another, you can select those data rows, click N selected, and click Send to to move them to a different split.

You can define a new split distribution during model run creation or re-use previous model run data split distributionYou can define a new split distribution during model run creation or re-use previous model run data split distribution

You can define a new split distribution during model run creation or re-use previous model run data split distribution

Curate data splits

You can move data between the train, validate and test splitsYou can move data between the train, validate and test splits

You can move data between the train, validate and test splits

Once you are happy with the data rows and data splits you selected for training a model, you are ready to train a machine learning model. You can choose to either train a model in your custom ML environment - outside of Labelbox (Link to doc), or to train a model via the one-click model training integration inside Labebox (Link to doc).

Visualize the distribution of your data splits

You can visualize the distribution of your data in each data splits in Projector view. You can color data rows by class to get a sense of whether there is a cluster formed, and how separable are the classes. You can also visualize whether your data splits share similar distribution.

Click on the Projector view icon. Select the data split you want to view. You can pick a class in that data split to color by clicking the color palette icon and selecting the class name.


Did this page help you?