Written by Felipe Almeida
With contributions from Caique Lima and Luiz Felix
Real-time Machine Learning refers to integrating Machine Learning in systems that operate continuously – this often means models that output scores and predictions on demand, when they are requested.
Like any such piece of software, things can and often do go wrong in a variety of ways. It’s a well-oiled machine where one failing piece usually has negative impacts downstream, such as:
- Response time issues due to sudden load increase
- Crashes due to upstream problems
- Crashes due to bad deploys
In addition to the above, there are many other types of failures that apply specifically for ML models:
- Missing/broken features causing wrong predictions
- Sudden changes in population distribution causing wrong predictions
The main difference between regular and ML-enabled software is that ML models may fail silently.
This is to say that ML systems may be producing wrong predictions even though there may be no explicit exceptions or error messages.
In the next sections we analyze lessons learned and best practices assembled from years of applying ML to real-life problems at Nubank.
Alerting vs Monitoring
Model monitoring refers to understanding and addressing latent behavior, whereas alerting usually refers to detecting urgent problems that must be dealt with immediately.
As such, the focus of monitoring is usually to detect issues with an eye to understanding and investigating what is going on.
There is, however, a close relationship between alerting and monitoring – the first action taken by someone addressing an alert may be precisely to open monitoring dashboards to compare short-term with medium-term data.
The focus of alerting is on getting the system running normally as quickly as possible.
Alerting | Monitoring |
Focus on speedy action | Focus on thorough understanding and investigation |
Short-term (hours, minutes) | Medium-term (days, weeks, months) |
Passive consumption (you get alerted) | Active consumption (you choose to look at dashboards) |
Operational monitoring still applies
ML-enabled systems are software too! This means that all the usual problems and issues that may happen to any other system can and will happen to ML-driven systems too.
Here are some points from regular software monitoring that also apply to ML systems:
Operational system health
As with any other piece of real-time, production software, you’ll may want alerts for regular metrics and health checks:
- System Errors
- Response times
- Scaling problems (CPU, RAM, etc)
Logging/Tracing
You’ll need the usual logging and distributed tracing tooling to centralize and enable analysis on log data.
This is essential in alerting because most true alerts will generally trigger some sort of investigation – this is where a solid logging infrastructure comes in.
Some common tools in this space are Splunk, Datadog and New Relic.
On-call schedules
It’s standard practice in general engineering teams to have on-call schedules such that someone is always available to address urgent matters.
Alerts must be standardized where possible
Standardization helps drive efficiency and helps you scale your processes.
It also enables you to view a collection of things as different versions of the same thing, thereby decreasing the cognitive load when interacting with large systems.
Alerts are no different, here are some examples of what can/should be standardized in this respect:
- All alerts should be communicated via the same tools where possible (Opsgenie, Slack, ad-hoc emails, etc)
- All alerts should (where possible) be formatted the same way: standard text, standard colors, standard styles
- All alerts should use similar metrics to convey information (e.g. averages, percentiles, min, max, counts, etc)
Include expected behavior
When writing the text for the alert, don’t just say what is wrong, say what was expected, and always include the time-frame evaluated.
This helps people understand how critical the alert is and how fast they need to act, increasing efficiency and reducing the chances of false-positives.
Good | Bad |
“Alert: Expected metric X to be within 100 and 150, got 250 in the last 30 minutes” | “Alert: Current value for metric X is 250” |
“Alert: Value for metric X was 500 in the last hour. Expected 100 (stddev=25) based on historical data.” | “Alert: Value for metric X is above expected value: 500” |
Alerts should be actionable
Whenever possible, add a link or a clear course of action to help responders act in response to alerts.
This is helpful both to seasoned engineers and to newcomers who may have never faced a specific issue before.
An even better way to do this is to have standard playbooks with guides on how to address the most common problems, where to go for help, etc. This ensures standardized processes and decreases the risks of human error.
Ask yourself when creating an alert: “What is the first piece of information the responder will need to look for when addressing the alert? How can I make it easier for them?”
Good | Bad |
“Alert: Expected zero deadletters in the last 30 minutes, got 1,000. Click here to open DLQ and retry configs” | “Alert: 1,000 deadletters in the DLQ” |
“Alert: Model X has not responded healthchecks for 5 minutes. Click here to view the playbook for common problems and fixes.” | “Alert: Model X is unresponsive” |
“Alert: Average response time for model X in the last 30 minutes is 500ms (expected 300ms). Click here to edit scaling configuration” | “Alert: Average response time for Model X is 500ms” |
“Alert: 50% of events scored by model X have received high scores in the last 30 minutes (expected 1%). Click here to edit this feature flag or reach out to engineers in #some-slack-channel for help.” | “Alert: 50% of events scored by model X have received high-scores in the last 30 minutes” |
Alerts should be easily configurable
Alerts for machine learning models will eventually become obsolete and/or stale with time.
This may happen due a multitude of reasons: the underlying distribution of data changed over time, business or engineering requirements have changed, or even another alert that already encompasses the current one was released.
Alerts stop working in one of two ways:
- Over-sensitive: They get too sensitive and start going off too often (this is usually called alert fatigue)
- Under-sensitive: They get too blunt and never go off again
In other words, the precision/recall tradeoff may need changing.
Make it so that anyone can easily:
- Edit alert configuration (change thresholds to calibrate the signal/noise ratio, etc)
- Snooze the alert for some time
- Disable the alert altogether
- Acknowledge the alert (more on that later).
Good | Bad |
“Alert: <…alert text…> Click here to edit alert configuration” | “Alert: <…alert text…>” |
“Alert: <…alert text…> Click here to edit alert configuration Click here to snooze this alert for 6 hours.” | “Alert: <…alert text…>” |
Take your audience into consideration
Alerts should be written with the desired audience in mind – this will ensure that the message you want to convey is actually picked up on the other side.
Different people play a role in delivering a functioning ML-enabled system to production. These include, for instance: engineering folks, DS/ML practitioners and product/business folks.
Depending on who your target audience is, you will want to adapt:
- The language used
- The metrics used (engineering metrics for engineering, statistical metrics for DS/ML folks, business metrics for product/business)
- The action to be taken (engineering folks will want to look at low-level system metrics, business folks are really only interested in the impact to the business)
Engineer Audience | DS/ML Practitioner Audience | Product/Business Audience |
“Alert: Timeout rate for realtime model Y is at 50% for the last 5 minutes (expected between 1-5%) Click here to view pod health and scaling settings.” | “Alert: Feature X used by model Y is taking on average 500ms to be retrieved in the last 5 minutes(expected 50-100ms) Click here to view feature retrieval dashboard” | “Alert: Fewer customers than usual are being given loans in the last 5 minutes (expected 100, actual is 1) Click here to view the business dashboard. For more information go to #some-channel on Slack” |
Alerts must be acknowledgeable and trackable
Alerts should by definition be “rare” and the triggering of an alert is, by necessity, a somewhat chaotic and messy event.
You need at the very least a solid way to signal that an alert is being dealt with.
Being able to acknowledge (or “ACK” for old-timers) helps your team ensure that there’s at least one person actively investigating the present alert. It also prevents multiple people from interfering with each other. Most alerting tools support this (e.g. OpsGenie).
In addition to being acknowledgeable, alerts should ideally be trackable – that is to say, there should be a log of the timeline of the alert, for example:
- When did the alert go off?
- Who was involved in dealing with the alert?
- How did we verify the alert was real?
- Was it a false positive?
- How was it mitigated?
- What was done to avoid similar problems in the future?
Such logs help future engineers find information and will probably make future incidents easier and faster to deal with.
They will also enable you to analyze alert data and discover, for example, common system culprits and alerts patterns.
Other Tips
Alerts for the absence of events
We usually use counts, averages and sums to detect abnormal behavior.
However, if a particular service has stopped working altogether, there may be no logs – which means that there will be no averages, no counts and sums.
One way to address this is to have heartbeat alerts, whereby you must ping some external API to signal that your service/system is healthy.
Such heartbeat alerts are usually configured with a time period and if your service/system does not send a ping in this period of time, that will trigger an alert.
Test your alerts beforehand
Just like any piece of code, you must test that the alert does trigger when it should.
One way to test alerts is by making trigger thresholds artificially low, so the alert is more sensitive and testable:
- Are the calculations correct?
- Are the appropriate people being notified (opsgenie, slack, etc)?
- Are the supporting features (ack’ing, tracking, etc) working as expected?
Beware of seasonality
Feature data for real-time systems is usually representative of customer data. As such, it’s prone to natural cycles such as day/night, weekday/weekend, etc.
This may hinder alerting flows because the definition of “normal behavior” usually depends on the time of the day, the day of the week, etc.
One way to address this is to include a minimum sample size threshold to make sure some alerts (e.g. rate of actions) only get triggered if there is enough data.
For example: trigger alert if the rate of scores above 0.9 is over 20% but only if the sample size is at least 10,000
At scale, cost will be an issue
When you operate at a high enough scale (think millions of requests to a real time model per day), cost will become an issue.
Alerting usually needs to be done at real time (although there are uses for batch alerts as well) so you will need robust and expensive infrastructure and tools to handle all of those events.
One simple way to deal with this is to use sampled data for alerting instead of the full data.
In other words, you could select a random sample of 10% of your data and calculate alerts on that instead of using the whole data – most statistical metrics will be the same, at a fraction of the cost. Recall, however, sampling the data only yields sound results if the underlying distribution is large enough.