What to do when your training and testing data come from different distributions

142 阅读1分钟

What to do when your training and testing data come from different distributions

credit: www.chessbazaar.com/blog/game-c…

To build a well-performing machine learning (ML) model, it is essential to train the model on and test it against data that come from the same target distribution.

However, sometimes only a limited amount of data from the target distribution can be collected. It may not be sufficient to build the needed train/dev/test sets.

Yet similar data from other data distributions might be readily available. What to do in such a case? Let us discuss some ideas!

Some background knowledge

To better follow the discussion here, you can read up on the following basic ML concepts, if you are not familiar with them already:

  • Train, dev (development), and test sets: Note that the dev set is also called the validation or the hold-on set. This post is a good short introduction to the topic.
  • Bias (underfitting) and variance (overfitting) errors: This is a great simple explanation of these errors.
  • How the train/dev/test split is correctly made: You may refer to this post that I have written before for a short background on this topic.

Scenario

Say you are building a dog-image classifier application that determines if an image is of a dog or not.

The application is intended for users in rural areas who can take pictures of animals by their mobile devices for the application to classify the animals for them.

Studying the target data distribution — you found that the images are mostly blurry, low resolution, and similar to the following:

Left: Dog (Volpino Italiano breed), Right: Arctic fox.

You were only able to collect 8,000 such images, which is not enough to build the train/dev/test sets. Let us assume you have determined you’ll need at least 100,000 images.

You wondered if you could use images from another dataset — in addition to the 8,000 images you collected — to build the train/dev/test sets.

You realized you can easily scrape the web to build a dataset of 100,000 images or more, with similar dog-image vs. non-dog-image frequencies to those frequencies required.

But, clearly this web dataset comes from a different distribution, with high resolution and clear images such as the following: