Curate data splits

Overview

It is best practice to split the labeled data set into three sets: training, validation, test. Doing so greatly reduces your chances of overfitting your model. The visualization below roughly shows the recommended allocation of Data Rows to each set.

300300

Use the validation set to evaluate results from the training set. Then, use the test set to double-check your evaluation after the model has "passed" the validation set. The following figure illustrates this workflow.

300300

Configure data split via SDK

You can use SDK to customize how you want to split your data rows for a given model run. If you have not specified a data split for data row in SDK or UI, it will default to UNASSIGNED. Unassigned data rows will still show up in the Model Run view, under the “All” split.

# Here is a list of data rows ids you want to upload to a model run
datarow_ids = [...] # your data row ids for a model run
model_run.upsert_data_rows(datarow_ids)

# You can specify the split logic however you want, or assign individual data row id to a particular split
train_split, val_split, test_split = datarow_ids[:num_train], data_row_ids[num_train:num_train+num_val], data_row_ids[num_train+num_val:]

model_run.assign_data_rows_to_split(train_split, "TRAINING")
model_run.assign_data_rows_to_split(val_split, "VALIDATION")
model_run.assign_data_rows_to_split(test_split, "TEST")

Configure data splits via the UI

Once you have created a model run inside a model, the model run will access all the data rows selected for training from the Create a model step. From here, you can configure the train, validation and test splits.

  1. The default data split is 80% training, 10% validation, and 10% testing. You can adjust the data splits by using the slider, or typing in the input field. If you have a previous model run within the model directory, you can choose to load from the previous config. You also need to name the model run, such as “model iteration 1”.

Next, click Create model run.

  1. Now you should be able to see the annotations on the train, validation, and test splits (you might need to wait for a few seconds and refresh it for the UI to finish loading all data rows). From here, you can view the annotations from each data split.

If you want to move some data rows from one split to another, you can select those data rows, click N selected, and click Send to to move them to a different split.

12001200

You can define a new split distribution during model run creation or re-use previous model run data split distribution

Modify data splits in UI

33903390

You can move data between the train, validate and test splits

Once you are happy with the data rows and data splits you selected for training a model, you are ready to train a machine learning model. You can choose to either train a model in your custom ML environment -
Export data for model training outside of Labelbox , or to train a model via the one-click model training integration

Visualize the distribution of your data splits

You can visualize the distribution of your data in each data splits in Projector view. You can color data rows by class to get a sense of whether there is a cluster formed, and how separable are the classes. You can also visualize whether your data splits share similar distribution.

Click on the Projector view icon. Select the data split you want to view. You can pick a class in that data split to color by clicking the color palette icon and selecting the class name.

37863786

Did this page help you?