A Guide for Model Selection During Extrapolation

What to expect when you don’t know what to expect.

Thomas Giavatto
IBM Data Science in Practice

--

Written by Thomas Giavatto and Daniel Fleck

Regression is one of the most common techniques used in machine learning. Predicting future rainfall in flood or drought regions. Analyzing heart rate changes for vulnerable patients. Forecasting peak energy consumption on a grid that provides energy to necessary public services. What these examples all have in common is that the maximum or minimum predicted values are of greatest importance in all cases. For an environmental policymaker, being able to predict periods of average rainfall is of little benefit. Anticipating periods of drought or flood saves lives and property. Forecasting peak heart rates in patients can help doctors treat or more closely monitor at-risk patients. Lastly, ensuring energy production matches energy demand is a critical public service necessary during extreme events.

Something else these examples all have in common are that these extreme events do not occur often, and are therefore hard to predict. Modeling of these extreme cases is known as extrapolation. Extrapolation is “a prediction from a model that is a projection, extension, or expansion of an estimated model (e.g., regression equation, or Bayesian hierarchical model) beyond the range of the data set used to fit that model” (Bartley et al., 2019). For example, how can a public utility company predict energy usage higher than peak historical consumption levels? Extrapolation is more common than one may think and can occur outside of extreme use cases. A paper from Facebook’s AI Research and New York University, including Yann LeCun, notes that in high dimensions (>100 features), “the behavior of a model within a training set’s convex hull barely impacts that model’s generalization performance since new samples lie almost surely outside of that convex hull” (Balestriero et al., 2021). This means that, with high dimensional data, feature values for new samples in our test set do not commonly lie within the space of the training set, and therefore extrapolation commonly occurs during testing.

A scatter plot with a line of best fit. There are blue dots with a blue line representing interpolation, and red dots with a red line representing extrapolation.
Figure 0.1

However, extrapolation can cause misleading results, as it is hard to extend predictions to events not present in our data. Although extrapolation can be risky it is hard to avoid. In a paper on identifying extrapolation in ecological modeling the authors point out, “While ecologists and other scientists know the risks associated with extrapolating beyond the range of their data, they are often tasked with making predictions beyond the range of the available data in efforts to understand processes at broad scales” (Bartley et al., 2019).

This article will address the scenario when a model is used to predict target values that are outside of the historical training data range.

In extrapolation what is important to understand is where a model fails. In the analysis below we will discuss examples of extrapolation and explain what modeling techniques perform best on an extrapolation set when compared to their results on previously seen data. Our goal is to help guide you on where to begin when you are faced with a problem that could include extrapolation.

Framework:

Below we outline three regression use cases provided by sample datasets in UC Irvine’s Machine Learning Repository. For each use case, four different popular machine learning models will be constructed. These four models are: Linear Regression, Random Forest Regressor, Decision Tree Regressor, and Histogram-based Gradient Boosting Regressor. For each of the use cases, we will test the models’ performance on a training and a test set, and compare that to performance in an extrapolation set. Root Mean Square Error (RMSE) is the performance metric used to compare predictions per model. In the table below, we provide additional detail highlighting the benefits and drawbacks of each model.

A table with four rows, each row representing an algorithm. The columns describe the algorithm and discuss its benefits and drawbacks.
Figure 0.2

Extrapolation Set

As noted above, extrapolation is necessary when a model needs to make predictions on data that is outside of the known or previously seen distribution. In these three use cases, the full dataset is already captured. Thus, we need to ensure the models don’t see all the data until it comes time to experiment with extrapolation. To achieve this, we use common data science train/test/validation-style splits (with a twist!) to generate an extrapolation region. In testing different modeling techniques for accuracy in extrapolation, we first train and test our models on the distribution that is not in the extrapolation set before testing our model on the extrapolation set. The models were iteratively tuned so as not to overfit the training set.

The violin plot below shows the distribution of the target variable for one of the example use cases, as well as the distributions of the train/test and extrapolation regions created. The top plot shows the total distribution of the data with red dots denoted as outliers. The middle plot shows the training and testing set created from this data with all target values ≤ 90ᵗʰ percentile of the distribution. And the bottom plot shows the extrapolation region, created with the top 10% of target values¹. Note that this example dataset does have many large outliers and thus the extrapolation region represents examples of extreme extrapolation.

Three blue violin plots with red dots to represent outliers. Each plot is horizontal and the images are stacked on top of each other.
Figure 0.3

For model training and testing, we use a random 75%/25% split of the training and testing set created in the middle plot above. This post covers only the extrapolation region described above. Additionally, three other experiments were run to test the effect of using different proportions of data for train/test and extrapolation. The results and conclusions for those experiments can be found in the associated code. Check it out! https://github.com/Tgiavatto/Extrapolation_model_experiments.git

Hypothesis

Based on the characteristics of the models we are experimenting with, we suspect that Linear regression will provide the lowest drop in performance when predicting the extrapolation region compared to that of the train and test region. Linear regression produces a linear model that is technically unbounded (though not defined for the extrapolation region). Alternatively, tree-based models can only make predictions for values that were seen in training (i.e., bounded by maximum target value of training). Linear regression may not always be the best performing model in the train and test region, but it may be the most sensible model to avoid large errors in the extrapolation region by compromising on optimal performance.

Hypothesis: Linear regression will provide the lowest decrease in performance when predicting the extrapolation region compared to that of the train and test region.

In the following sections we provide outlines and analysis of each use case. We also report which modeling technique performs best when extrapolating.

Use Case 1 : Traffic Data

The goal of the first use case is to predict hourly traffic volume for Interstate 94 in Minneapolis-St Paul, Minnesota (Hogue). The dataset includes both traffic data provided by the Minnesota Department of Transportation, and weather data from OpenWeatherMap, for the period 2012–2018. This dataset includes over 48,000 records with both categorical and numeric features. The categorical data describes US Holidays and general weather descriptions. The numeric data describes measured weather attributes such as temperature and precipitation.

Why did we choose this use case?

This use case demonstrates extrapolation when the feature set contains a mix of categorical and numeric data.

As explained above, we created our extrapolation set by withholding the greatest 10% of target values, and used the remaining 90% to train and test our data. The violin plot below shows the distribution of the entire dataset and the 10% split used for extrapolation¹. The target variable for this dataset does not have any large outliers and therefore our extrapolation set will not include extreme records that fall far outside the training and testing set.

Three blue violin plots. Each plot is horizontal and the images are stacked on top of each other.
Figure 1.1

Results

Before fitting the model, the numeric features were standardized, and the categorical features were One Hot Encoded. The fitted model feature importance plots, shown below, show that the most influential features across all models include categorical weather features and days of the week.

A horizontal bar chart describing the feature importance for Linear Regression.
Figure 1.2
A horizontal bar chart describing the feature importance for Random Forest.
Figure 1.3
A horizontal bar chart describing the feature importance for Decision Tree.
Figure 1.4

Figure 1.5, displayed below, shows the RMSE values for the training, test and extrapolation sets. This chart shows there is a noticeable difference in performance across the set of models. Train and test values of RMSE are all similar for random forest regressor (RFR), decision tree regressor (DTR), and gradient boosting regressor (HGBR), while linear regression (LR) is the worst performing model.

A vertical paired bar chart for each of the four models’ RMSE values. There are bars for training, testing, and extrapolation RMSE’s.
Figure 1.5

When predicting the extrapolation region, the same holds true; RFR, DTR, and HGBR all have very similar performance, while linear regression remains the worst performing model. However, in relative terms, linear regression’s drop in modeling performance is the smallest of the models analyzed. Figure 1.6, displayed below, shows the relative performance decline by model for the extrapolation set. Linear regression’s RMSE increases by less than 2.0 times the test set’s performance in the extrapolation region while all other models see an RMSE increase of at least 2.6 times. While linear regression has the highest values for RMSE in both the testing and extrapolation region, in relative terms it has the lowest drop in performance in extrapolation.

A vertical bar chart describing the relative RMSE decrease for each model.
Figure 1.6
A table with four rows, one for each algorithm examined. Columns display the RMSE values for train, test, and extrapolation sets. A final column displays the relative drop in RMSE.
Figure 1.7

Takeaways

This use case includes numeric and categorical features. The target variable has a multimodal distribution and is not heavily skewed. Predictions were made on an extrapolation set of data where all values were larger than the greatest target value seen in training. In this case, when looking at overall RMSE, tree-based models vastly outperformed a linear model.

However, when examining the relative drop in performance on the extrapolation set, linear regression was the best model. Although tree-based models continue to perform the best, in this use case linear regression confirms our hypothesis as it provides a compromise in optimal performance for less variation in error when predicting on unknown data.

Use Case 2: Bike Data

The goal of the next use case is to predict the hourly count of rental bikes being rented through the Capital Bikeshare program in Washington, DC (Fanaee-T). The dataset includes ~18,000 records with weather and seasonal information from 2011–2012 and the corresponding hourly count of rental bikes. This also includes both categorical and numeric features. The categorical features describe seasons, holidays, and weather categories. The numeric features describe count of registered users along with weather metrics like temperature, humidity, etc.

Why did we choose this use case?

This is a relevant use case as another example of how extrapolation works when your dataset has a mix of categorical and numeric features. However, the difference between this use case and the first use case is that this dataset includes many high outliers whereas the first dataset does not. Therefore, this use case is an example of extrapolating more extreme outliers from our training and testing set than we saw in the first use case.

In Figure 2.1 below, you can see the distribution of the entire dataset and the 10% split used for extrapolation. This dataset as a whole has a high number of outliers, as denoted by the red points in the top graph¹. This is a great example of a use case where we are trying to predict anomalous events using a training and testing set that includes records with target values far below the events we are trying to predict.

Three blue violin plots with red dots to represent outliers. Each plot is horizontal and the images are stacked on top of each other.
Figure 2.1

Results

Before fitting the model, the numeric features were standardized, and the categorical features were One Hot Encoded. In this use case, the most influential features slightly change across modeling techniques. We created a few historical lag features for casual and registered users one, two, and three days in the past to see if that is a strong indicator of future performance. Linear regression included the number of casual and registered users from one day ago as two of the most influential features. Linear regression also determined that weather conditions were important features (Fig 2.2).

The historical lag features for casual and registered users were also important features in tree-based models. They were both among the top 6 features, but were not as valuable as they were for linear regression. Random forest and decision tree determined seasonal and weather features to be more important (Figs 2.3 and 2.4).

In other words, the linear model strongly prefers numeric features, where the tree-based models can use the numeric and categorical columns equally well. This is the expected behavior.

A horizontal bar chart describing the feature importance for Linear Regression.
Figure 2.2
A horizontal bar chart describing the feature importance for Random Forest.
Figure 2.3
A horizontal bar chart describing the feature importance for Decision Tree.
Figure 2.4

In Figure 2.5 below, you can see the tree-based models are more accurate in modeling our training and testing set. We again notice linear regression is the worst performing modeling technique for this use case in training and testing with an RMSE over 1.5 times that of our three other modeling techniques.

A vertical paired bar chart for each of the four models’ RMSE values. There are bars for training, testing, and extrapolation RMSE’s.
Figure 2.5

Comparing our testing modeling results and our extrapolation modeling results we again see a large jump in RMSE. In testing our extrapolation sets we continue to see tree-based models as the best performing modeling techniques. However, in relative terms linear regression’s drop in modeling performance when predicting the extrapolation region is once again the smallest of the models and in this use case by a large margin. We see in Figure 2.6 below that linear regression’s RMSE increases by about 4.6 times the test set’s performance in extrapolation, while all tree-based models see RMSE increases of at least 7.5 times that of their test set’s. While linear regression is the worst performing model in absolute terms in both the testing and extrapolation region, in relative terms it once again has the lowest drop in performance in extrapolation.

A vertical bar chart describing the relative RMSE decrease for each model.
Figure 2.6
A table with four rows, one for each algorithm examined. Columns display the RMSE values for train, test, and extrapolation sets. A final column displays the relative drop in RMSE.
Figure 2.7

Takeaways

When examining model performance on the extrapolation set, all models have similar performance. Although the Traffic (Use Case 1) and Bike (Use Case 2) use cases are structured similarly, a linear model extrapolates much better here compared to the tree-based models than in the Traffic use case. One possible explanation for why a linear model’s relative drop in performance in extrapolation is smaller than tree-based models in this use case is that the distribution of values is more extreme. As mentioned above the data distribution has a high number of extreme outliers and therefore the target values in the extrapolation set are farther away from the known distribution in our training set than in the previous use case. This makes sense, given our intuition that linear models should extrapolate well compared to tree-based model in linear relationships. One interesting experiment to test this theory would be to use less data during training/testing, and then evaluating model performance on the extrapolation region. (Spoiler alert: we did this, and the results are in the shared Git repo).

In relative terms, Linear regression does better in extrapolation compared to its training and testing results than the other models. Therefore, when we can assume a linear relationship and our use case needs to predict very extreme outliers, we again accept the hypothesis that linear regression would perform the best relatively in extrapolation compared to its results in training and testing.

Use Case 3: Superconductivity

The last use case uses measured superconductor metrics and the superconductor’s chemical formula to predict superconductor critical temperature levels (Hamidieh). The dataset is provided by the Superconducting Materials database and includes 81 features from ~22,000 superconductors. Numeric features include atomic radius, electron affinity, atomic mass, etc. Categorical features include features for the number of elements and other counts that have been converted to categorical features in our modeling.

Why did we choose this use case?

This use case is an example of how extrapolation works on a modeling set with a high number of numerical features. There are upwards of 80 numeric features in this dataset of approximately 22,000 records.

When analyzing the total dataset before splitting into an extrapolation set¹, we only notice a few outlier values for the total distribution. The outliers in the violin plots below are denoted by red points in the distribution plot at the top. Therefore, our extrapolation set will not include many extreme outliers compared to the rest of the distribution.

Three blue violin plots with red dots to represent outliers. Each plot is horizontal and the images are stacked on top of each other.
Figure 3.1

Results

This dataset includes both numerical and categorical features. This use case differs from the previous two because this case has many numeric features and only 2 categorical encoded features. Similar to the previous two use cases, the feature transformations include standardizing the numeric features and One Hot encoding the categorical features. Important features vary slightly across modeling techniques, but primarily they consist of variations of atomic radius and mass features. All modeling techniques also saw categorical features describing the number of elements in a superconductor as being in the top 10 most important features. Electron affinity and thermal conductivity features had different levels of importance across the modeling techniques. Linear regression viewed these as relevant features and most notably saw electron affinity features as having a strong negative effect on critical temperature (Fig 3.2). Random forest and decision tree models viewed electron affinity features to have relatively little impact on the models (Figs 3.3 and 3.4). All modeling techniques saw numeric features as the most beneficial.

A horizontal bar chart describing the feature importance for Linear Regression.
Figure 3.2
A horizontal bar chart describing the feature importance for Random Forest.
Figure 3.3
A horizontal bar chart describing the feature importance for Decision Tree.
Figure 3.4

In Figure 3.5 below you can see that linear regression once again is the worst modeling technique for this use case. Tree-based models random forest regressor and gradient boosting regressor have the lowest RMSE for both training and testing.

A vertical paired bar chart for each of the four models’ RMSE values. There are bars for training, testing, and extrapolation RMSE’s.
Figure 3.5

Comparing the test set performance and the extrapolation set performance, we continue to see tree-based models random forest regressor and gradient boosting regressor perform the best. In relative terms again we notice linear regression’s drop in modeling performance when predicting the extrapolation region is the smallest of the models. Figure 3.6 below shows linear regression’s RMSE increases by approximately 2.9 times the test set’s performance in extrapolation, while all tree-based models see RMSE increases of at least 3.5 times that of their test set’s. Although this dataset is different than the previous use case where we saw a high number of extreme outliers, we continue to see linear regression perform the best on the extrapolation region in relative terms.

A vertical bar chart describing the relative RMSE decrease for each model.
Figure 3.6
A table with four rows, one for each algorithm examined. Columns display the RMSE values for train, test, and extrapolation sets. A final column displays the relative drop in RMSE.
Figure 3.7

Takeaways

The final use case examines a dataset with over 80 columns, almost all of which are numeric. This use case has a small number of outliers in the target distribution. This use case is unique compared to the previously discussed use cases in those regards and yet we still see similar results in relative performance during extrapolation. In comparing the model performance metrics on the extrapolation set, we prove our hypothesis that linear regression will have the lowest relative drop in performance when predicting extrapolation sets.

Final Takeaways

As we mentioned in our hypothesis, linear regression can provide a sensible model when considering the trade-off between optimal model performance and maintaining relatively accurate predictions at outlier values. Also, since these linear regression models are only first order polynomials, overfitting is impossible. This gives us confidence that our model won’t behave erratically outside of the training data range. All our examples have shown that when using a model to predict values outside of the known training data range, linear regression provides the smallest relative decrease in performance.

Tree-based algorithms are known to be more advanced/powerful algorithms. But if a model is making predictions that are outside the training data range, there is little confidence the predictions will be reliable, particularly if the model overfits the training data. This is evident in our use case examples.

Linear regression may not be the most accurate model, but its relative performance in extrapolation in use cases with linear relationships should always make it a consideration when deciding on modeling techniques for extrapolation.

Footnotes

1) The violin plot images are fitting a distribution to the data, making it appear as if there is some overlap between the training/testing data and the extrapolation data. Note that these two sets are completely disjoint, and all values are ≥ 0.

References

Bartley, M. L., Hanks, E. M., Schliep, E. M., Soranno, P. A., & Wagner, T. (2019). Identifying and characterizing extrapolation in multivariate response data. PLOS ONE, 14(12). https://doi.org/10.1371/journal.pone.0225715

Balestriero, R., Pesenti, J., & LeCun, Y. (2021, October 29). Learning in High Dimension Always Amounts to Extrapolation. Retrieved April 1, 2022, from https://arxiv.org/pdf/2110.09485.pdf

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Hogue, J. (n.d.). UCI Machine Learning Repository: Metro Interstate Traffic Volume Data Set. Retrieved April 1, 2022, from https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

Fanaee-T, H. (n.d.). UCI Machine Learning Repository: Bike sharing dataset data set. Retrieved April 1, 2022, from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Hamidieh, K. (n.d.). UCI Machine Learning Repository: Superconductivity Data Data set. Retrieved April 1, 2022, from https://archive.ics.uci.edu/ml/datasets/superconductivty+data

Additional Acknowledgements

This article and the associated code were co-authored by Thomas Giavatto and Daniel Fleck. We would like to give a special thanks to Robert Uleman for his thoughtful revisions, feedback, and guidance.

--

--