Practical Data Curation for Computer Vision

Improving accuracy through data curation

This article aims to outline experimental techniques for manually curating training data for image classification with neural networks. The general process is to build a model (typically starting with a small group of random examples), then test it and add new examples to the training set, typically edge cases, to increase the accuracy. This is written for the context of a production system that acquires new examples over time, not for a fixed academic dataset.

The training data should be representative of the data that will be submitted to the model. To improve accuracy and generalization for a model, we consider three factors for choosing training examples: label balance, quantity, and variety.

Label balance is the ratio of image labels. A rule of thumb is not having a ratio of more than 2:1 for labels in your training data. (eg. if you have 100 "no hotdog" examples, you need at least 50 "hotdog" examples and no more than 200 "hotdog" examples)

A rule of thumb is to have a minimum quantity of 50 images per label. For context, many well-known datasets have hundreds or thousands of examples per label. There is no known upper bound on data quantity. However, the quantity of training examples is limited by the number and variety of available examples. Low data quantity risks poor accuracy on unseen edge cases. High data quantity risks adding too many similar examples that lower variety and dilute edge cases. It is optimal for high accuracy if a variety of edge cases are well represented. There is not a clear metric for variety, but it can often be recognized by visual inspection. Using feature attribution techniques can also assist with visual inspections.

What follows are experimental techniques for curating new training examples. Adding "most wrong" images to training data will often help improve model accuracy. "Most wrong" examples are the highest confidence errors (followed by lowest confidence correctly predicted image examples) from the test or production examples. These most wrong images are frequently edge cases.

One simple data curation method that has worked well is to label these "most wrong" edge cases and choose them randomly for training data.

Another approach is to start by adding the single "most wrong" image to the training data. Sometimes a single image will improve the model. This is frequently the case if there are many "most wrong" examples that are similar. The downside to this method is that it takes many training iterations to build a large dataset. Adding a single image to each iteration and is a trade-off with training time.

Adding "most wrong" images to the training data can lower accuracy if it breaks training on arbitrary characteristics. However, this increases generalization. One solution is to exclude arbitrary characteristics in training examples. For instance, remove timestamps from images to avoid training on the numbers. In some cases, it may not be possible to separate the arbitrary characteristics from the image (eg.differences in day and night, seasons, brightness, exposure). In those cases, the suggestion is to include a variety of examples (often positive and negative cases) with the arbitrary characteristics.

Therefore, after adding a "most wrong" image, consider adding a similar image without the tag. (eg. you add an example labeled "hot dog" on a bun also add a similar image of a bun labeled "no hot dog") This should minimize the risk of training on arbitrary characteristics and maintain label balance.

Once you choose a "most wrong" image and a similar negative counterpart to add to the training data, do not choose another similar "most wrong" image. Choose the next "most wrong" image that is sufficiently different from the examples already added.

Training examples that don't test accurately are candidates for oversampling or for sourcing more similar examples from production systems.

Observation AI has features that enable these data curation techniques, such as the ability to search for high confidence errors and two sets of tags, one for ground truth and another to tag edge cases or training images.

This work is influenced by experiments attempting to build high accuracy models and sources such as Andrej Karpathy and Leslie Smith.