De-risking Real-time ML Projects: Addressing Common Failure Modes

In this post we list practices to substantially reduce the risk of failure in real-time machine learning projects.

people working at nubank

Written by Felipe Almeida and reviewed by Rubens Bolgheroni


Machine learning (ML) is powerful—but expensive and risky. It’s powerful because it’s often the only scalable solution to complex problems. It’s expensive because the resources (specialized professionals and computing power) are expensive. And it’s risky because models break in weird and sometimes silent ways: data drift or broken features will not trigger exceptions, even though model predictions will most likely be garbage.

An ML project is the process to take an idea to an actually working software system using an ML model.

We wrote about the uses of real-time ML in the past. But before you can think about using ML to help your organization, you need a project to take it from an idea into a functioning piece of software, delivering results. And you do not want the project to fail, as it’s money and time that could have been spent somewhere else in the business. This, of course, assumes you do need an ML model in the first place.

Now, even if an ML project does fail, it’s better if it fails quickly—before you have invested too much time and money into it. Figure 1 below shows the different outcomes for a project, depending on whether it delivered value to the business and the time it took to complete.

Figure 1: Project outcomes are rarely binary, but a continuum of success “levels”. Success depends on how much business impact was delivered, but we take into account how long and costly the project was.

Now why do ML projects fail?

There are many reasons why: data problems and misaligned expectations between clients and the modeling team, to name a few.

Real-time ML projects, however, elevate the stakes: they have even more failure modes than regular (non-real-time) projects. In addition to all the risks inherent to ML, real-time models are software systems that need to integrate with other real-time systems via APIs, they are often subject to strict response times and need to keep parity with the training environment.

In this post we will show what a typical real-time ML project looks like and then go over the most common ways these projects can fail. Finally, we will provide practical instructions on how to address, or de-risk each of those points.

Typical project timeline

As mentioned previously, a project is the process to take an idea and make it reality. In the case of real-time ML, this includes choosing a business problem, understanding if and how it can be solved with ML and, finally, creating a model and integrating it with the underlying business IT infrastructure.

The stages of a typical real-time ML project can be summarized as follows:

  1. Ideation & understanding the use-case: Understand how and by whom the model will be used. At times, the verdict might be that ML isn’t the correct solution.
  2. Data analysis & Modeling: Once the use-case has been established, organize the data and perform data exploration. Then, model the problem: feature selection, model training,
  3. Decision layer definition: Post-training, understand how the model’s outputs will translate to business choices.This may include optimization techniques (e.g. what is the optimal model cutoff score such that a given business metric is maximized?)
  4. Implementation / Real-time integration: Integrate the model into the underlying IT infrastructure, connecting it to other services to fetch features and deliver the decisions to the callers.
  5. Set up monitoring: Set-up data logging and configure tools like Splunk or Grafana to track model metrics and features.

Bear in mind that the steps above are not necessarily linear. That is, they need not happen one after the other. 

For example, one may need to go back to the ideation stage during the project, to revisit assumptions that turned out to be false. Also, data scientists may need to retrain the model if there’s a problem with a feature which now needs to be dropped. Figure 2 below presents a visual summary:

Figure 2: Typical real-time ML project. Note that sometimes we may need to “go back” to previous stages depending on unforeseen circumstances, such as inconsistent data and the need to drop features from the model, necessitating a retrain. Also, some steps may be parallelizable.

There are two types of real-time ML projects: introducing a new model and updating an existing model: 

  • Introducing a new model: Creating a new model from scratch and integrating it into a business flow that does not currently use models.
  • Updating an existing model: Updating or replacing an existing model with an enhanced version, with more features, more training data, a new algorithm, etc.

This distinction is useful because updating an existing model is less risky than introducing a new model into a business flow. If the way the model is used is the same, you can skip the use-case validation stage—which is where much of the risk lies.

Regardless of whether we are introducing a new model or updating an existing one, all stages of the project pose risks. Let’s now see some of the ways in which real-time ML projects fail.

Real-time ML Failure modes

As mentioned in the introduction, there are levels to failure in ML projects. Having a project fail because the model performance is lacking is bad. But having a project fail after investing significant time into it is much, much worse.

Each stage in a ML project has its unique vulnerabilities. This is especially true of real-time projects, as model deployment and service integration present an additional layer of complexity. Successful real-time ML projects acknowledge the risks—and mitigate them when necessary.

“All successful ML projects are alike; each failed ML project is unsuccessful in its own way.” – Leo Tolstoy, probably

Table 1 lists some of the failure modes for real-time ML projects. Some of them reflect miscommunication and business alignment issues; some reflect modeling problems and others relate to engineering and implementation mishaps. Many of these are also relevant for batch (i.e. non-real-time) ML projects.

Table 1: Non-exhaustive list of failure modes for Real-time ML projects

Failure mode / RiskExample / Description
Model performance isn’t good enoughA fraud detection model had such a low precision at training time that it couldn’t possibly be used due to large false-positive rates.
Model response time is too highAfter the model has been deployed, the team realized that the model response time (fetching features and scoring) is unacceptably high, so the model was never used.
Model performance at inference time is different from training timeThe model performed unexpectedly badly in production so it had to be discontinued. Possible causes: train-serve skew, data drift, data leakage.
Use-case doesn’t support probabilistic decision-makingEven though the model accuracy is good, the business flow doesn’t support a “probably true” outcome, perhaps due to legal responsibility. Every model decision needs human confirmation, so a fully model-based solution isn’t feasible.
Chosen features aren’t available at inference timeDuring implementation, the team found out that some features the model was trained with aren’t available in real-time. Those features had to be removed, severely degrading the model’s performance and rendering it useless. Possible causes: train-serve skew and data leakage.
Model is not economically viableThe cost of training, implementing and operating the model is so high that it offsets any financial gain its use might bring.
Feature creepToo much time was spent trying to optimize the model’s performance—it took so long to enter production that it lost business timing.

Now let us go through some practical steps you—ML practitioner or project manager—can take to address these. These are not listed in any order, and all of them have been found to be useful for us at Nubank. 

De-risk: Educate clients where possible

Risks Addressed: All of them

Many people still see ML as “witchcraft”, and have unrealistic beliefs of what it can do. You can—and should—help clients understand, at least at a high-level, how ML works, so they can calibrate their expectations to more realistic levels, and help you do your job.

People from sectors such as banks are more used to working with models but even there it is not wise to assume that they understand, even at a high level, what ML is.

Here are three key points all non-technical users should understand about ML:

  • Models are probabilistic. They may perform well on average but will get individual predictions wrong. Sometimes very wrong.
  • Models need a decision layer. As important as the model itself is the logic that decides what to do with a given model prediction. This is sometimes called a policy layer.
  • Models decay over time. They need maintenance and monitoring after they are deployed to production. And they will eventually need to be retrained.

One good way to help end-users build intuition about ML is to show feature importance plots, such as the beeswarm plot (seen below in Figure 3). This drives interesting discussions with non-technical users and helps them understand how a model score is calculated from its features.

Figure 3: Sample beeswarm plot from the SHAP library. Working through feature importance plots is a great way to “teach” non-technical users the basics of how a model works. This helps them build intuition about ML and it will make interaction more efficient. The force plot is also useful to explain a single prediction. Source

De-risk: Make sure the data is good

Risks Addressed: Model performance isn’t good enough | Model performance at inference time is different from training time.

The saying, “garbage in, garbage out”, encapsulates the essence of ML. It’s your responsibility to make sure that the data is good enough for modeling purposes.

It’s your job, as an ML practitioner, to understand if the available data is good enough for modeling—both at training and at inference time.

As always, real-time ML adds an extra dimension to this problem, so not only do you need to worry about the training data—whether there is enough of it, whether it’s good quality—but also about how to retrieve data at inference time.

We suggest a three-pronged de-risking approach here: (a) asking the correct questions about the data, (b) doing extensive data analysis and (c) establishing a relationship with the real-time data team:

Ask the correct questions about the data

At the start of the project, you want to ask many questions about what data is available, what it looks like. Here are some suggestions:

  • How much data do we have? (i.e. how many rows, customers, data points)
  • How much into the past does the data go?
  • Is the training data similar to the inference-time data?
  • Is the past data overwritten in any way? Can we be sure past data is “immutable” and we can come back to that point in time when we train the model? 

Do extensive data analysis

When you have access to the data, you need to conduct the usual EDA routines to check for data quality, and you need to double check every information you were told about the data, to make sure you aren’t misled into making wrong decisions.

Establish a relationship with the real-time data team as early as possible

The teams responsible for making data available at real-time are often not the same as those responsible for data “at rest”.

Find out who these people are and keep in touch with them so you can make sure the information you will train the model with will also be available for you when you need to make real-time predictions. Make sure you understand how fast or slow such data retrieval is.

De-risk: Understand how the model predictions will be used

Risks Addressed: Model performance isn’t good enough | Model response time is too high | Use-case doesn’t support probabilistic decision-making.

Imagine crafting a perfectly good model, with great performance, only to see it gather dust. The worst thing that can happen to an ML project is to produce a model that is never actually used to make business decisions.

You must know beforehand how the model predictions will be used—and by whom. 

One of the reasons this happens is that the modeling team didn’t take the time to clearly understand how the model is to be used, and by whom

A key part of understanding the model use is to discuss the decision layer—the code or business process that will take in the model predictions and decide what action to take based on the predictions. Talking about the decision layer forces the modeling and client teams to discuss how exactly the model predictions will be used—and sort out any misunderstandings in the process.

Figure 4 shows an example of what a very simple decision layer could look like for a simple credit-risk model. 

Figure 4: A typical credit underwriting system with a simple decision layer. A “decision layer” for a model may be as simple as a single if/else condition based on the model prediction.  If the risk score is below some threshold, grant the loan, otherwise deny it.

In addition to discussing the decision layer, the more questions you ask about the model use-case the higher the chance that you will uncover implicit assumptions that can cause problems later on.

Some of questions you should ask yourself: 

  • Engineering
    • Which team or service will use the model?
    • How will the caller make requests to the model? Http calls? Async calls?
    • Where will features and model predictions be saved or logged?
    • Where will the actual targets be saved? (so that you can monitor performance and retrain the model later)
    • Are there response time constraints for the model?
  • Business
    • What is the impact to the business if the model predictions are wrong?
    • What should the fallback be if the model is out of service?
  • Modeling
    • Will we need to explain the model decisions individually?
    • Do model predictions need to be calibrated probabilities?
    • Will we need an A/B test to measure the model impact?

De-risk: Conduct a pre-mortem

Risks Addressed: All of them

A post-mortem meeting takes place after a project has failed. Its objective is to understand the causes of failure so that they can be avoided in the future.

A pre-mortem is similar, but it takes place at the start of a project. It’s a brainstorm-like meeting and its objective is not to understand why a project failed—but to prevent it from failing. It works by having people pretend the project failed and ask them: “why did it fail?”. 

A pre-mortem is a brainstorm-like meeting that takes place at the start of the project. It tries to answer the question: “Let’s pretend the project has failed. What made it fail?”

There aren’t many rules as to how the meeting should be conducted, as long as it does take place. 

At the end of a good pre-mortem session, you should have not only a much better grasp of the risks you were already aware of but also knowledge of hitherto unknown risks you can now assess.

This is why it’s important for the meeting to include people from different backgrounds, such as business experts, other ML practitioners and—most importantly for real-time ML projects—software engineers, to signal potential integration and data problems.

De-risk: Calculate project valuation

Risks Addressed: Model is not economically viable

Every applied ML project should have a tangible impact on the organization. Such impact is often measured in monetary units (USD or equivalent) or other business metrics, such as conversion or engagement, among others.

Calculating a project’s valuation means understanding the project’s impact in terms of these metrics. It helps de-risk the project because you can find as early as possible whether the project is economically viable—and adjust course if it isn’t .

Additional benefits of calculating a project’s valuation include being able quantitatively rank-order projects and mitigating the undue influence of power politics (HiPPOs)

Wondering when to calculate valuation? We suggest doing it in two cycles:

  • First Valuation cycle (project start):  Before any modeling is done, come up with a rough estimate of the valuation, to see if this project is worth pursuing at all.  It is OK to make assumptions about the model performance. The objective is to detect existential threats to the project, as per the example below:
    • Example 1: According to the first estimate, using the model would be a net positive even if its precision is as low as 20%. Sounds reasonable.
    • Example 2: The model would only make economic sense if its performance is above 99%. Too risky, we should abort.
  • Second Valuation cycle (decision layer definition): During the decision layer definition stage (see Figure 2), we should have the model test set, which is a good proxy to how the model will perform at inference time. With the test set you can update the valuation estimate produced in the first cycle.
    • Example2: With the test set, projections indicate a 50% increase in email conversions, translating to an additional US$ 1 Million annually. The project is, therefore, economically justified.
    • Example 2: With the test set, we now estimate a total gain in US$50,000 in one year. But the team compensation alone is over US$100,000 in that period. It doesn’t make economic sense to move forward with the project, as its costs outweigh the benefits.

De-risk: Select features based on importance and implementation effort

Risks Addressed: Feature creep | Chosen features aren’t available at inference time

In batch (i.e. non-real-time) ML models, features are roughly equal in terms of the effort needed to implement them. Sure, some may require a slightly more involved SQL query, some extra joins, but rarely more than that.

Real-time features, however, vary wildly in terms of how much engineering effort they take. Some features may be fetched with a simple HTTP call at inference time. Others may require that you build a service and an endpoint so that the model can use it at inference time.

In real-time ML projects, you must take implementation effort into account when selecting features.

During feature selection, you must also take into account the work needed to implement a given feature—assuming it is available in the real-time infrastructure in the first place! In Figure 5 below we see a 4-way classification of features depending on the two relevant dimensions for selecting features: predictive power and implementation effort.

Figure 5: Not all features in real-time ML are born equal: In addition to weighing a feature’s importance (e.g. SHAP), we must also take into account how much engineering effort is needed to use a given feature at inference time. The most cost-effective features are those that have large predictive power while being simple from the point of view of implementation (top left).

As for how to rank features in terms of implementation effort, we suggest starting with the questions in Table 2. As usual, the easiest features are those available from a feature store (with the added plus you don’t need to worry about train-serve skew).

Table 2: Ease of implementation for real-time features, assuming a synchronous microservice architecture

DescriptionImplementation effort
The feature is readily available from a feature storeLOW
Feature is not in a feature store but it’s already in use by another model, so we can easily repurpose it.LOW
Feature is readily available from an external HTTP endpoint.LOW
Feature is available via a single HTTP endpoint, but it needs some preprocessing.LOW / MEDIUM
Feature needs information from multiple HTTP endpoints.MEDIUM
Feature exists in a production database but we need to create an endpoint to retrieve it.MEDIUM
Feature exists in a production database that is not accessible. A new service needs to be built to access it.LARGE
It is not clear how to access the feature at inference time, if at all possible. Further discovery is needed to scope the problem.LARGE + NEEDS FURTHER DISCOVERY

Once a reasonable set of features is decided upon, the project lead should declare a feature freeze, such that only the selected ones will be included in the project. This decreases the risk of feature creep—a situation where new features keep being added to the model and the project never ends.

De-risk: Address highest-risk features first

Risks Addressed:  Model performance at inference time is different from training time | Chosen features aren’t available at inference time | Model performance isn’t good enough

If you follow the maxim “select features based on effort”, you’ll have a rough estimate of the engineering effort needed to implement each feature. Engineering effort estimates, however, are known to be imprecise and hard to get right—somewhat akin to witchcraft.

Usually (but not always) features that take the most effort will also be the riskiest ones to implement. By risky we mean that they can jeopardize the project’s success.

Start implementation with these to address that risk early on. In other words, shift implementation of risky features left in the project timeline.

Addressing the most uncertain parts of a project first, to expose hidden risks early on, is sometimes called a “shift-left” strategy.

Why? Shifting these risky tasks left allows you to discover potential showstoppers early on: External endpoints that are too slow for your needs. Services that cannot handle the scale of requests being made. Features that turned out to be flat out unavailable.

The earlier the modeling team is aware of feature problems, the earlier they can adapt—using proxies instead or even dropping them altogether.

Discovering showstoppers early on in the project helps prevent the worst possible failure mode—finding out the project has failed after a lot of effort was put into it.

As soon as features have been selected you can already start work on the implementation discovery (brainstorming, designing implementation strategies, discussing with other engineers, etc). The earlier the better.

Shifting risk left is good advice in all parts of any project, but especially during feature implementation in real-time ML, because of cascading changes that will trigger new modeling rounds and decision layer refitting (see Figure 2).

De-risk: Deploy the model in shadow mode as early as possible 

Risks Addressed: Model response time is too high | Model performance at inference time is different from training time | Chosen features aren’t available at inference time  

The final leg of the implementation—where you connect the caller service to the real-time model—is the riskiest part of the project from an engineering perspective. Many unforeseen issues appear when the time comes to connect regular services to real-time ML models.

But it doesn’t need to be that way. By adopting a shadow-mode deployment from the outset, you can simulate an end-to-end flow without awaiting project culmination.

Shadow-mode—standard deployment with a twist: model outputs are ignored by the caller services.

Shadow-mode deployment is the name for an applied ML pattern whereby you implement a real-time ML model, but ignore the response at the very last moment, for example, using a feature flag. Shadow-mode deployments are great to de-risk projects because you are able to observe how the model works in a “real-life” scenario without exposing yourself to any business risk. Figure 6 provides a visual representation.

Figure 6: Regular vs Shadow-mode Deployment of real-time ML models: Everything is the same except for the fact that the model scores are discarded, and the caller service carries on as if nothing happened. A shadow-mode deployment de-risks a project because you are able to detect many production-time issues such as performance under load, integration with feature services, schema problems, etc.

Deploying a model in shadow mode is useful in and of itself. But doing it at the start of the project—even before all the features have been implemented—is even better:

  • Feature-fetching code and the model will be stress-tested under production circumstances, exposing response time issues and other problems, as they are implemented.
  • Monitoring tasks are unblocked when you have a shadow-mode deployment. Even if the model isn’t in actual use, you can already build the monitoring infrastructure (logging functions, dashboards, reports, etc)

Figure 7 below shows this: enabling shadow-mode at the start de-risks the project and makes it more efficient.

Figure 7: The sooner you enable shadow-mode, the better: Deploying a new model in shadow mode right from the start de-risks the project and makes it more efficient due to parallelization opportunities. Obviously, model scores will not be useful if most features are NULL, you will be able to expose engineering risks as soon as possible and you can set-up monitoring right away.

Sure, shadow-mode may demand upfront setup work, but it’s a game-changer for ML projects, ensuring smoother, safer real-time model interactions.

Conclusion

The truth is that real-time ML is hard because it involves ML models and real-time services—each of those being a complex system that depends on several parts that must work perfectly. 

We listed many of the failure modes of real-time ML, but note that many of those are related to ML in general, with some particularities for the real-time setting.

We also suggested several practices to help you avoid those failure modes. However, even when following all of those it’s still possible that the project will fail—life happens—but success chances will be higher.
And remember, everything in this post is related to getting the model out of the door in the first place. After that, it’s still a tough road to keep it operating. After the first deployment, you’ll need careful monitoring (especially train-serve skew) and alerting to make sure everything is working well.