most read
Software engineering
Going agile: do less to deliver more Aug 14
Software engineering
Why We Killed Our End-to-End Test Suite Sep 24
Culture & Values
The Spark Of Our Foundation: a letter from our founders Dec 9
Careers
We bring together great minds from diverse backgrounds who enable discussion and debate and enhance problem-solving.
Learn more about our careers
This post was reviewed by: Luis Moneda, Gabriela Mourao, and Cristiano Breuel.
Machine learning (ML) models are very sensitive pieces of software; their successful use needs careful monitoring to make sure they are working correctly.
This is especially true when business decisions are made automatically using the outputs of said models. This means that faulty models will very often have a real impact on the end-customer experience.
Models are only as good as the data they consume, so monitoring the input data (and the outputs) is critical for the model to fulfill its real objective: to be useful to drive good decisions and help the business reach its objectives.
Here are a couple of actionable, framework-agnostic tips you can use to have a more robust monitoring strategy when using machine learning models in production.
(Many tips have some overlap with each other – this is because they are supposed to be used as part of an integrated strategy, not as one-off fixes)
Averages don’t tell the full story
Context
You monitor average values for numerical features in the models you use. You do this because you want to detect data problems, understand when/if feature and label distributions change, etc.
Claim
Average value monitoring doesn’t tell you the full story because it carries a couple of assumptions that don’t necessarily match up to reality. For example:
In short, there may be a problem that affects your data in severe ways but the average values for the features may not move at all, which is why you should include other angles in addition to it.
Suggestions
Check our job opportunies
Policy/Decision Layers require additional monitoring
Context
The model is used to make decisions (deny/approve loans, display/not display an advertisement piece, etc) using some sort of policy.
You monitor models from a technical perspective (feature values, precision, accuracy, etc) but it isn’t obvious what decisions are made from that (this is the policy/decision layer).
Claim
It isn’t enough to monitor models from a technical perspective because this doesn’t make it clear to other stakeholders how the business is being impacted.
You should also monitor the decisions made using the model so that you can make sure the model is delivering the expected business value.
Suggestions
Break monitoring into subpopulations for better insights
Context
You are responsible for maintaining heavily-used machine learning models that are used to score many individual examples daily.
You monitor features/scores via dashboards and there are several “interesting” patterns you want to investigate, but it usually takes a lot of time to track down the reasons for these issues.
Claim
One way to make it easier to understand data and/or model problems is to split monitoring data into subpopulations (subsets of the data being scored by the model) and monitor those separately.
The reason for this is that many data problems have a critical impact on some subsets of examples, but they may “disappear” because their absolute impact is not enough to be felt when you look at aggregate values over the full dataset.
Suggestions
Appropriate feature encoding makes monitoring easier
Context
Features used in models are often preprocessed or encoded to enable their use in some classifiers.
This is sometimes a problem because it’s hard to visually or programmatically monitor complex, carefully engineered features that aren’t immediately obvious upon a first look.
Claim
By encoding (or decoding) features carefully, you can make monitoring easier. This is because most monitoring frameworks are better suited for numerical values and categorical values. If you use different types of features (e.g. word embeddings, geolocation coordinates), you may want to decode them (e.g. into strings and city names, respectively) so that you can more easily analyze those in reports and plots.
In addition, you probably want to monitor the original (non-preprocessed, non-encoded) values, because this makes it easier to communicate with other teams and troubleshoot problems when they arise.
Suggestions
Consistency reduces the mental burden of monitoring
Context
You are responsible for maintaining/operating one or more machine learning models, each of them having several features, being used in distinct ways, etc.
You have several dashboards and reports being generated, but the sheer effort required to go through them is too high and it takes a lot of time.
Claim
It’s possible to reduce the cognitive burden and the time needed to go through dashboards and monitoring reports.
One way to do this is to promote consistency and standardization, so that context switching costs are minimized and your team can be more efficient/effective.
Suggestions
<team-name>-<model-name>-<date>
) so that it’s easier to automate and configure those for the whole team.Monitor monitoring jobs/routines themselves (meta-monitoring)
Context
You use helper routines, batch jobs or ad-hoc scripts to process model log data. You use these routines to analyze the model features and scores and output aggregate values. You also use these tools to generate alerts under certain conditions.
Claim
Model monitoring batch jobs/routines are just another piece of software and they usually stop working from time to time (someone changed the name of a table and the script breaks, your credentials expired, etc).
If you count on monitoring jobs/routines/scripts to run and signal problem conditions, the absence of alerts may lead you into thinking things are OK when in reality the monitoring jobs just didn’t run or there was some problem with them.
You need to monitor the monitoring jobs themselves to guard against this (meta-monitoring).
Suggestions
Patterns for batch monitoring
Context
You have batch jobs that analyze model log data and calculate aggregates on those (average feature value over each day, average scores, etc) but someone needs to actually go look at the data to see if everything’s OK.
Claim
It’s easy to create monitoring reports that go ignored because nobody has had the time to actively go to the dashboard/notebooks and look at the results. Here are some ways to make them more useful and efficient.
Suggestions
Realtime models only: Train/Serve Skew monitoring
Context
Models used to make inference at realtime are usually trained on historical data retrieved from a database, in a batch fashion.
This creates the risk that the data path used for training doesn’t exactly match the data path used for inference (usually HTTP calls to external services to obtain features). This is called train/serve skew.
Claim
Train/serve skew is a major risk you should consider when deploying realtime models. This must be monitored continually, as long as the model is in use.
The most common causes for mismatches here are changes in external services the model depends on to fetch feature data at realtime.
Suggestions
Realtime models only: Patterns for alerting
We now have a dedicated post on this topic: Best Practices for Real-time Machine Learning: Alerting
Context
You have created some real-time alerts (e-mails, slack messages, mobile push notifications, etc) to alert you when the model is behaving in unexpected ways, such as weird feature values, missing features, scores being too high/too low, etc.
Claim
It is pretty easy to end up with alerts that are either too noisy (go off very often and people don’t take them seriously anymore) or not sensitive at all (they never go off, even when they should).
You should try to keep alerts relevant and easy to act upon (include enough information for people to quickly tell if the alert signals an actual problem).
Suggestions
Conclusion
These are a couple of tips we found useful for monitoring several ML models here at Nubank.
They are used in a variety of business contexts (credit, fraud, CX, Operations, etc) and we believe they are general enough to be applicable in other companies too.
Check our job opportunies