4 Diversity Sampling

This chapter covers

Understanding diversity in the context of Machine Learning, so that you can discover your model’s “unknown unknowns”
Using Model-based Outliers, Cluster-based Sampling, Representative Sampling, and Sampling for Real-World Diversity to increase the diversity of data selected for Active Learning
Using Diversity Sampling in different types of Machine Learning models so that you can apply the technique to any Machine Learning architecture
Evaluating the success of Diversity Sampling so that you can more accurately evaluate your model’s performance across diverse data
Deciding on the right number of items to put in front of humans per iteration cycle to optimize the Human-in-the-Loop process

In the last chapter, you learned how to identify where your model is uncertain: what your model “knows it doesn’t know”. In this chapter you will learn how to identify what’s missing from your model: what your model “doesn’t know that it doesn’t know”, that is, the “unknown unknowns”. This is a hard problem, made even harder because what your model needs to know is often a moving target in a constantly changing world. Just as humans are learning new words, new objects, and new behaviors every day in response to a changing environment, most Machine Learning algorithms are also deployed in a changing environment.

4.1 Knowing what you don’t know: identifying where your model is blind

4.1.1 Example data for Diversity Sampling

4.1.2 Interpreting neural models for Diversity Sampling

4.1.3 Getting information from hidden layers in PyTorch

4.2 Model-based outlier sampling

4.2.1 Use validation data to rank activations

4.2.2 Which layers should I use to calculate model-based outliers?

4.2.3 The limitations of model-based outliers

4.3 Cluster-based sampling

4.3.1 Cluster members, centroids and outliers

4.3.2 Any clustering algorithm in the universe

4.3.3 K-Means clustering with cosine similarity

4.3.4 Reduced feature dimensions via embeddings or PCA

4.3.5 Other clustering algorithms

4.4 Representative Sampling

4.4.1 Representative Sampling is rarely used in isolation

4.4.2 Simple Representative Sampling

4.4.3 Adaptive Representative Sampling

4.5 Sampling for real-world diversity

4.5.1 Common problems in training data diversity

4.5.2 Stratified sampling to ensure diversity of demographics

4.5.3 Represented and Representative: which matters?

4.5.4 Limitations of sampling for real-world diversity

4.6 Diversity Sampling with different types of models

4.6.1 Model-based outliers with different types of models

4.6.2 Clustering with different types of models

4.6.3 Representative Sampling with different types of models