Machine learning (ML) models are very sensitive pieces of software; their successful use needs careful monitoring to make sure they are working correctly.
This is especially true when business decisions are made automatically using the outputs of said models. This means that faulty models will very often have a real impact on the end-customer experience.
Models are only as good as the data they consume, so monitoring the input data (and the outputs) is critical for the model to fulfill its real objective: to be useful to drive good decisions and help the business reach its objectives.
Here are a couple of actionable, framework-agnostic tips you can use to have a more robust monitoring strategy when using machine learning models in production.
(Many tips have some overlap with each other – this is because they are supposed to be used as part of an integrated strategy, not as a one-off fixes)
Averages don’t tell the full story
You monitor average values for numerical features in the models you use. You do this because you want to detect data problems, understand when/if feature and label distributions change, etc.
Average value monitoring doesn’t tell you the full story because it carries a couple of assumptions that don’t necessarily match up to reality. For example:
- If there is missing data, most numerical tools ignore those and calculate averages for the remaining (nonnull) data;
- It assumes that data problems will be large enough to move the average value significantly. Alternatively, it may also be the case that feature changes move the averages a lot but don’t affect higher percentiles – which is often where model decisions are made.
- It assumes that changes in model scores have a linear relationship with the actions based thereon.
In short, there may be a problem that affects your data in severe ways but the average values for the features may not move at all, which is why you should include other angles in addition to it.
- Monitor the percentiles for numerical feature values – e.g. 99th, 95th, 90th and 10th, 5th, 1st percentiles too. This way you can detect cases where tail examples change even when the average feature value didn’t change. This is especially useful for cases where the data distribution is skewed and/or imbalanced.
- Monitor missing value rate for all features. This needs to be monitored separately because a large percentage of missing values is a big problem – even when the mean of non-missing values hasn’t changed much.
- Split monitoring into subpopulations to detect issues that only affect a subset of the total examples scored.
Policy/Decision Layers require additional monitoring
The model is used to make decisions (deny/approve loans, display/not display an advertisement piece, etc) using some sort of policy.
You monitor models from a technical perspective (feature values, precision, accuracy, etc) but it isn’t obvious what decisions are made from that (this is the policy/decision layer).
It isn’t enough to monitor models from a technical perspective because this doesn’t make it clear to other stakeholders how the business is being impacted.
You should also monitor the decisions made using the model so that you can make sure the model is delivering the expected business value.
- Monitor the decisions made using the model. For example: how many people got loans approved by the risk model on each day? How many people had their accounts blocked by the fraud model on each day? It’s often useful to monitor both absolute and relative values here.
- Be careful to adjust the level of granularity depending on the target audience: if your model scores a given customer multiple times, your target audience may be more interested in metrics aggregated by customers than individual entities scored.
- If you are running a realtime model, how many wrong decisions were made due to train/serve skew mismatches? You should also monitor this.
Break monitoring into subpopulations for better insights
You are responsible for maintaining heavily-used machine learning models that are used to score many individual examples daily.
You monitor features/scores via dashboards and there are several “interesting” patterns you want to investigate, but it usually takes a lot of time to track down the reasons for these issues.
One way to make it easier to understand data and/or model problems is to split monitoring data into subpopulations (subsets of the data being scored by the model) and monitor those separately.
The reason for this is that many data problems have a critical impact on some subsets of examples, but they may “disappear” because their absolute impact is not enough to be felt when you look at aggregate values over the full dataset.
- Instead of looking at aggregate feature/score values for whole datasets, break that into subpopulations and monitor those instead.
- For example: if you have a fraud model, it may be worthwhile to split monitoring by the type of device (web, mobile, etc) used in each example scored.
- Also monitor the raw counts for separate populations (how many examples of each were scored each day, what was the percentage, etc)
Appropriate feature encoding makes monitoring easier
Features used in models are often preprocessed or encoded to enable their use in some classifiers.
This is sometimes a problem because it’s hard to visually or programmatically monitor complex, carefully engineered features that aren’t immediately obvious upon a first look.
By encoding (or decoding) features carefully, you can make monitoring easier. This is because most monitoring frameworks are better suited for numerical values and categorical values. If you use different types of features (e.g. word embeddings, geolocation coordinates), you may want to decode them (e.g. into strings and city names, respectively) so that you can more easily analyze those in reports and plots.
In addition, you probably want to monitor the original (non-preprocessed, non-encoded) values, because this makes it easier to communicate with other teams and troubleshoot problems when they arise.
- Monitor input values (i.e. not necessarily features themselves but information used to build features) in addition to the features themselves. This is useful when you apply several numeric transformations to those.
- Whenever possible, encode boolean feature values as floats (1.0, 0.0 and null), so that it’s easier to monitor them like a regular numerical variable (extract means and other numerical properties, etc) and reuse all tooling (e.g. plots) made for those.
- For categorical features encoded with strategies such as one-hot-encoding or target-encoding, you probably want to decode them back into their original values so that you can monitor the actual classes, not the encoded categories.
Consistency reduces the mental burden of monitoring
You are responsible for maintaining/operating one or more machine learning models, each of them having several features, being used in distinct ways, etc.
You have several dashboards and reports being generated, but the sheer effort required to go through them is too high and it takes a lot of time.
It’s possible to reduce the cognitive burden and the time needed to go through dashboards and monitoring reports.
One way to do this is to promote consistency and standardization, so that context switching costs are minimized and your team can be more efficient/effective.
- Use a single tool to monitor everything – if at all possible, using a single tool/vendor for all model monitoring. This makes it easier to share configuration and use patterns among several models.
- Order things consistently. For example: order feature plots according to the feature importance so that you can quickly see if there are serious problems you need to investigate (or just ordering them alphabetically)
- Name things consistently: if you need to name things such as files, datasets, dashboards, tables, etc make sure that you follow some sort of pattern (like
<team-name>-<model-name>-<date>) so that it’s easier to automate and configure those for the whole team.
Monitor monitoring jobs/routines themselves (meta-monitoring)
You use helper routines, batch jobs or ad-hoc scripts to process model log data. You use these routines to analyze the model features and scores and output aggregate values. You also use these tools to generate alerts under certain conditions.
Model monitoring batch jobs/routines are just another piece of software and they usually stop working from time to time (someone changed the name of a table and the script breaks, your credentials expired, etc).
If you count on monitoring jobs/routines/scripts to run and signal problem conditions, the absence of alerts may lead you into thinking things are OK when in reality the monitoring jobs just didn’t run or there was some problem with them.
You need to monitor the monitoring jobs themselves to guard against this (meta-monitoring).
- Monitor the execution times for the job. Steadily increasing execution times may indicate you’ll soon have to change strategies. Execution times that are too short may indicate that there was some other problem in the job.
- Use heartbeat-style alerts. You can add a step at the end of every job/script to send a ping to some other system. Heartbeat alerts go off when something has not happened, for example if the backend hasn’t received a ping for more than 24 hours.
Patterns for batch monitoring
You have batch jobs that analyze model log data and calculate aggregates on those (average feature value over each day, average scores, etc) but someone needs to actually go look at the data to see if everything’s OK.
It’s easy to create monitoring reports that go ignored because nobody has had the time to actively go to the dashboard/notebooks and look at the results. Here are some ways to make them more useful and efficient.
- Have the jobs actively send the results (charts, tables, etc) as a message to your email or slack channel at the end of each run. In the message body, only include the most important information.
- The idea here is to have just enough information so that you can quickly see if there are any problems.
- In the message body, add a link to the full dashboard/report data so that people can look at the whole thing if they need to.
- Write code that can handle historical data by default: this makes it easier to reuse code both for historical analysis and for incremental (e.g. daily) monitoring.
- Use business language whenever possible so that all stakeholders (not just technical people) can understand the impact of the models in business decisions.
Realtime models only: Train/Serve Skew monitoring
Models used to make inference at realtime are usually trained on historical data retrieved from a database, in a batch fashion.
This creates the risk that the data path used for training doesn’t exactly match the data path used for inference (usually HTTP calls to external services to obtain features). This is called train/serve skew.
Train/serve skew is a major risk you should consider when deploying realtime models. This must be monitored continually, as long as the model is in use.
The most common causes for mismatches here are changes in external services the model depends on to fetch feature data at realtime.
- Monitor the rate of exact matches between batch/realtime flows (i.e. you can have a 1 if there was an exact match between both flows and 0 if there was no exact match) and monitor this value like you monitor other features. This way you can see how much skew you have per feature.
- Monitor the magnitude of the deviations: for the cases where there was a mismatch between the batch and the realtime data paths, how bad was it? Was it just a small difference or a big one?
- Monitor counts too: monitor how many examples there were in each data path for a given day. This is important because unknown changes may cause more examples to be scored by real-time models that you expected.
Realtime models only: Patterns for alerting
You have created some real-time alerts (e-mails, slack messages, mobile push notifications, etc) to alert you when the model is behaving in unexpected ways, such as weird feature values, missing features, scores being too high/too low, etc.
It is pretty easy to end up with alerts that are either too noisy (go off very often and people don’t take them seriously anymore) or not sensitive at all (they never go off, even when they should).
You should try to keep alerts relevant and easy to act upon (include enough information for people to quickly tell if the alert signals an actual problem).
- Watch out for weird time periods like early morning, weekends etc. Since these are times where you may get far fewer examples scored by models, alerts may go off for the simple reason that your sample size is too small.
- Always include the time frame and specific data points used for the alert to allow people to evaluate whether it’s a false positive or not: Bad: “Feature X in model Y is too high”. Good: “Average value of Feature X in model Y for the past 15 minutes was too high (expected between 0.4 and 0.5, but got 100.0 instead)”
- If possible, include a link to the full dashboard or somewhere you can look at more complete data and decide if you should investigate it further.
- If possible, have some sort of troubleshooting guide at hand so that new team members can easily act on alerts.
These are a couple of tips we found useful for monitoring several ML models here at Nubank.
They are used in a variety of business contexts (credit, fraud, CX, Operations, etc) and we believe they are general enough to be applicable in other companies too.
Stay tuned for the next posts and check our open positions!