How Low is Too Low for R2
Why This Article?
Over the last few months, I have been keenly following the rigor with which the healthcare community is conducting clinical studies during the current pandemic. I was contrasting this with the more free-style approach that we in the industrial data science community use in proving out our models and deploying them. While the stakes are arguably lower, the impact of even a simple mistake can be similar- the performance of your AI model in the lab will not be reproducible in the field. This article series is an overview of common mistakes that I have noticed practitioners in the industrial sector commit while building machine learning models and how to avoid them. I hope you will find this article useful if you are:- A subject matter expert in the industrial sector and are looking to move towards building a career in data science
- An executive charged with generating ROI from digital programs
- An industrial data science expert looking to validate your own experience.
- Management of machine learning models throughout their lifecycle
- Building end-to-end data integration pipelines from raw data ingestion to analytic output delivery
- Optimization of your models without impacting their generalizability
- Rinse and repeat design patterns that scale to all of your data across the varying modalities, volumes, and velocities.
- How do you and your business leaders decide if your model is worth taking to the field?
- Will your model live up to its lab performance in the field?
- How do you combine your model with legacy reasoning tools or human intelligence to provide the best possible outcomes to your business?
Issue One: R2 – How low is too low?
If you have data science colleagues in the digital world, you would have noticed with envy how easy it is for them to prove the impact that their models are generating. For them, graduating a model, from the lab, to an A/B testing environment and on to production when proven, is a run of the mill operation. For example, if you believe you have a better recommendation engine, you deploy it on a portion of your web traffic and compare its performance to the previous model. If more people are clicking on your recommendations, then your model will be in production within no time! Simple. Unfortunately, our non-digital world is much more complex. Before you get an approval to take your model from lab to the field where it will start impacting the business, you will be asked the question: Is your model field worthy? Let us focus on this question using a machine learning performance metric R2. R2 is the standard metric in machine learning regression problems and is defined as “the proportion of the variance in the dependent variable that is predictable from the independent variable(s).” The higher R2, the better the model is. Ideally, you would want all your output variability captured by your model with R2 of 1. However, in industrial data science, you will very often be limited by the quality of datasets. The phenomenon you are trying to model may have too many external dependencies that are not captured by your data. The high-frequency data sources may have been down-sampled significantly before they were written to storage. The current sensing systems on your equipment are not designed for advanced analytic use cases. Often, you will end up with a model that provides only a partial explanation of your outputs. You will end up debating with your business users on whether it improves the status quo at all. An obvious question arises. What is a good enough R2 to warrant deploying the model in production? Is there an obvious value of R2 where the model should be discarded? I have noticed many data scientists interpret R2 as if it were a statistics test where a model is rejected when an R2 threshold is not met. This is possibly inspired by the analogue of statistical testing where you reject the null hypothesis if p-value is less than the significance level. However, the correct answer is, it depends. Whether the model is useful or not is dependent on the business problem we are solving.A Small Contrived Example
Let me explain with an example of a contrived game of chance. While the example is R2 focused, the same ideas apply for other model performance metrics like accuracy, F2 score, adjusted R2 etc. You are playing a game similar to roulette. You bet on where a ball will land on a spinning wheel that has 100 numbers, partitioned into three categories as follows. The ball is supposed to fall equally likely on any of the numbers. So, your chances of winning when you bet on category 1 or 3 is 45% and on category 2 is 10%. On each game or ball spin, you can bet $10 dollars. You double your money if the ball falls in your chosen category but lose your money if it does not. So, if you play long enough, you will end up losing all your money because whatever category you bet on, you have a higher likelihood of losing. This is where it gets interesting. Because you are a data scientist, you started noticing a pattern. The roulette machine seems to be not functioning correctly, and you have a hunch that the category that the ball is falling in has some dependency on the weight of the person spinning the wheel. You decide to observe 1,000 games and build a model to predict the game outcome based on spinner’s weight. Using this model, you plan to play the next 10,000 games in hopes of making a lot of money. Consider six different scenarios. For each scenario, both the training dataset and the simulated dataset are here https://assets.deepiq.com/docs/training.csv. The only difference between the six scenarios is the strength of dependency between the game outcome and spinner’s weight. In the first dataset, your hunch proves accurate. The game result has a strong dependency on the spinner’s weight. In the later data sets, I progressively reduced the relationship, and for scenario 6, your hunch is completely wrong. There is no relationship between the weight and game outcome. For each data set, let us fit a simple regression model to predict the game result based on spinner’s weight and use this model to play the 10,000 games attached in this dataset https://assets.deepiq.com/docs/simulatedgames.csv. The regression models and the training data are shown in Table 1. In Figure 1, as R2 decreases, your return from using the model also decreases. However, even with a R2 of 0.2, you are making a profit of $9,150; much lower than $85,690 you were making with R2 of 0.9, but much better than playing without a model and incurring an expected loss of $10,000. However, at R2 of 0.1, you started incurring losses and you were better off not playing the game with this model. So, if your intention is to make a positive return from this gambling game, all the above models where R2 is equal to or greater than 0.2 would have been useful. Long story short, whether a model is good or not is dependent on the problem you are trying to solve and the returns you want from your model. You can still make money or generate a positive ROI using “poor quality” models. Having some model that generalizes well might be significantly better than having no model at all. In this simulated game, a “poor R2” model provided you a sufficient advantage over your opponent, the spinner. In real-world business cases, the fact that you have a little less uncertainty in your expected outcomes than your competitors can still provide a differentiating advantage.Your Problems are not Constant
Businesses typically consider their problems as a given and data science projects as a tool that either succeeds or fails in solving this prescribed, constant problem. However, they leave value on the table by doing this. Optimal use of your analytic models might require you to adapt your business strategy and, in some cases, completely change your business model.A Small Contrived Example
In the roulette game, we started losing money when our models have an R2 of 0.1 or less. Let us try to change our strategy a bit. Say you noticed that when the model predicts category 2, it is highly likely to be wrong. So, you change your strategy as follows – When the model predicts category 2, you do not bet on the game. Figure 2 shows the returns in all the six scenarios between the old and new strategy. Now, you will make money even at a R2 of 0.1. So, by adapting your strategy, you converted your “bad model” to a useful model. In fact, except for the first model, this strategy will net you more money for all other instances than the first strategy. So, except in the first exceptional case, your business will benefit by changing your strategy of playing the game. This scenario explains the interplay between business strategy and your model performance. Many times, in industrial data problems, your data is fixed and collecting more data is not an option. It may not be possible to improve the performance of your best models. The only thing you can adapt is your strategy. Once you know the best you can do with your data, the focus should shift to what is the best you can do with this information. Using the additional information provided by these models, can a business strategy evolve to give the business a possible competitive edge? While decision sciences and risk analysis are mature fields that deal with this issue, I see that many businesses still struggle with this thinking. Consequently, focusing on high R2 models leads to a rat race. Data scientists feel tremendous pressure to generate higher quality models without a real change in the fundamentals of the problem (better data or more flexible problem definition). When faced with the alternative of cancellation of the entire program, they try multiple strategies to develop “impressive” models including:- Complex models – Maybe you should add more layers to your network model?
- Feature engineering – Maybe you should start normalizing the differential pressure with input pressure and retrain?
- More creative data partitioning – How naïve of you to expect the model capture this failure mode when it never saw this in training data. Maybe, you should swap this portion?
- Business Strategy: My model is not good enough to shift completely to a predictive maintenance strategy. But is it good enough to reduce the frequency of our planned maintenance?