Dealing with Train-serve Skew in Real-time ML Models: A Short Guide

Written by Felipe Almeida
Reviewed by: Tiago Magalhães

In this post we’ll explain what train-serve is and why we need to understand it when working with real-time machine learning models.

We’ll provide examples on how to monitor and debug train-serve mismatches—along with lessons we learned over years applying real-time ML to business problems at Nubank

What is train-serve skew?

To understand train-serve skew you need to remember that real-time ML models are trained in a different environment from the one they are used (served) in. Also, these models are usually trained and productionized by different people.

Training is normally done in an analytical environment (interactive notebooks or datalakes, for example) but serving happens in the production environment (microservices, edge devices, etc). This is what we picture in Figure 1 below:

Figure 1: Training an ML model usually takes place in an analytical environment (database tables, notebooks, etc) whereas actual use of a model occurs in the production environment (real-time services, edge devices, APIs, etc). Unexpected differences between these cause train-serve skew.

Everything in ML depends upon a very important premise: the training set must reflect the actual data the model will score at serving time.

Train-serve skew refers to the situation where the symmetry between the training and serving environments is broken—due to operational problems and faulty logic. In other words, we have skew when there are differences between the way the model was trained and the way it is served.

Train-serve skew refers to differences between the environment where a model is trained and where it’s served (or put to use).

Some examples of train-serve skew:

Different date filters: A data scientist coded a feature that counts the number of purchases a customer made in the last 30 days. The serving-time implementation was wrongly coded: it only counts the last 15 days.

Null vs Zeros: At training time we chose to use NULL to represent when a feature is missing for a given customer. However, the real-time systems use 0 (zero) to signal missing data instead.

Different data sources: At training time, we build a categorical feature using data from static snapshot purchased from a third-party data provider. At serving time, we retrieve feature values from an API—and it turns out they aren’t exactly the same.

You can access many more examples in the section “Understand the type of mismatch” in this article.

Check our job opportunies

Why do we care about train-serve skew?

We need to fix (or at least be aware of) train-serve skew because it can have severe impacts to a model’s predictions—and consequently to business processes that depend on it.

A clear example is in credit underwriting: Banks often use real-time credit-risk models to decide who gets to receive a loan. Undetected train-serve skew may cause high-risk customers to receive loans when they shouldn’t, causing financial losses to the banks, not to mention possible systemic failure to the banking system.

Why does train-serve skew happen?

The main reason why train-serve skew happens is miscommunication during the pre-deployment phase of a model.

Real-time models are usually trained and deployed by different people (data scientists and machine learning engineers, respectively), so communication mistakes are bound to happen and what gets implemented—or served—is not the same as what’s been trained.

Train-serve skew is caused by miscommunication between modeling and engineering teams or by unexpected changes in services used to fetch features from.

But even after a model has been in operation for some time, new train-serve mismatches may appear. This is due to upstream changes or errors when fetching real-time features.

Modern companies operate within a microservices architecture, so real-time models need information from other services, owned by other teams—and these services may change or break without notice, impacting the features used in the model.

What types of skew are there?

In our experience, there are two different types of train-serve skew, depending on which stage of the lifecycle they happen:

Pre-deployment skew happens when there are mistakes as the model is first deployed to production;
Post-deployment skew may occur at any time during model operation, due to unexpected changes or failures in upstream services where features are fetched from.

These can be seen below in Figure 2:

Figure 2: Simplified ML Model lifecycle. The first deployment needs a train-serve skew check to make sure that the model features are implemented properly in the first place. But after it’s been successfully deployed we still need to keep monitoring, detecting and possibly fixing train-serve mismatches as they appear.

In the following sections we will go over some of the strategies to avoid—or at least mitigate the effects of—train-serve skew in our real-time models at Nubank.

Avoiding and mitigating train-serve skew

Train-serve skew will never be totally eliminated. There will always be cases where a real-time feature will break due to temporary failure in supporting services, for example. A few cases of bad features here and there shouldn’t cause problems—especially if you use robust classifiers such as tree-based models.

The cases we want to guard against are those where a large part of model predictions is affected by bad features, as these could have severe impact on business processes that use model predictions.

To avoid—or at least mitigate—train-serve skew, the first thing you need is to collect feature data from both data paths (train and serve). We cover this article in Prerequisites: Data collection.

Once feature data is being collected in both data paths, the task of addressing skew is reduced to:

Comparing and monitoring feature values;
Detecting mismatches;
Debugging and fixing problems as needed.

In a microservices architecture, real-time models are just another service. Models depend on upstream services for feature data, but it’s impossible to control what other teams will do—and neither should we, as that would hurt their autonomy and agility.

This is why we believe that monitoring is the only scalable way to defend against train-serve skew—so you can detect and react to problems, without becoming a bottleneck for other teams.

Using a feature store also helps avoid train-serve skew. Feature stores are usually operated by a centralized team, which frees the modeling team from having to deal with the problem. If you can, use a feature store.

In the next subsections we go over the full approach to dealing with train-serve skew at Nubank: we’ll talk about the prerequisites (making sure you are collecting the data you need) and then we’ll cover what and how to monitor train-serve skew, how to interpret it to detect mismatches and, finally, we’ll cover some strategies on how to actually debug and fix the issues.

Prerequisites: data collection

The first thing you need to be able to compare data from the two different data paths is to make sure you are collecting that data.

This means that you need (a) a programmatic way to generate training data on demand and (b) you need to record or log features used for every real-time execution of the model.

To address point (a), you need a repeatable, programmatic way to generate training data for arbitrary dates. You need some kind of function or routine that takes a pair of dates and outputs the training data for that time period. See Figure 3 for an example:

Figure 3: Programmatic training data generation for a given date period is essential to being able to deal with train-serve skew, because this is what we will compare serving data against. Note that we don’t need to include the target variable in the data, only features.

On point (b) (recording feature data from real-time execution) you need some way to log the scoring identifier (ID) and the features used at serving time. This can be easily done by saving feature data to a database or to a logging service such as Splunk.

When you have both (a) and (b) covered, detecting train-serve skew is much easier—you just need to join together features for a given instance and compare them. This is shown on Figure 4 in the next section.

Monitoring train-serve mismatches

As explained above, you need both training- and serving-time data to monitor train-serve skew. Once you have a robust way to generate those continuously, it’s a simple matter of using any dashboard tool to visualize the data.

Monitoring can be used both for pre-deployment and for post-deployment stages. You can monitor a model before it’s in use by having a so-called shadow-mode deployment: deploy a real-time model but ignore its predictions.

You can monitor features in use by a real-time model even before it’s in production—deploying the model in shadow-mode is a common pattern.

The way you can generate monitoring data is as follows: you take train-time and serving-time data and apply a join between them, using the instance ID as the joining key. This is shown below in Figure 4:

Figure 4: Building a monitoring dataset from train and serving data: when you have data from both data paths you can just join them to build a temporary dataset with both values for each feature. In the example shown, we have a mismatch for feature x: its value should be 5.0 but we got 10.0 instead.

When you have a monitoring dataset such as the one shown in Figure 4, you can proceed to the actual monitoring, as we explain below.

Types of train-serve skew monitoring

The so-called “monitoring dataset” is all we need to generate all monitoring plots shown in the next sections.

Although there may be other ways to visualize train-serve skew, we’ll cover the ones we find the most important and which we use in our day-to-day work.

Percentage of exact matches;
Mean differences in feature values;
Percentile differences in feature values.

Each of these plays a role in helping practitioners keep ML systems healthy. We’ll explain each in detail in the next subsections.

Percentage of exact matches per day, per feature

You can monitor train-serve skew by plotting the percentage of exact matches, per feature, over time.

In Figure 5 below we can see one such plot, for “Feature X”. In the Y-axis we have the percentages of exact matches and in the X-axis the time.

It’s easy to see that we had around 90% match rate on Jan 2, 2022 and around 60% on Jan 5, 2022. This means that 10% and 40%, respectively, of instances scored had wrong values for Feature X at serving time.

Figure 5: Plotting percentage of exact matches for a feature, per day. It’s very easy to see there was a minor problem on 2022-01-02 and a major problem on 2022-01-05. A monitoring dashboard should include multiple plots like this–one for each feature in the model.

Note that this plot does not give any information about the magnitude of the differences. Read on for what to do next.

Average difference between training and serving, per feature

Monitoring the percentage of exact matches (as above) is a good start, but not enough. You need to understand the magnitude of the mismatch to know how serious it is and whether you should investigate it further.

You can plot the difference between the train-time and serving-time values of features, as can be seen in the example below.

Figure 6: Plotting the average numerical difference between train-time and serving-time values for a given feature. We can see that the magnitude of the skew on 2022-01-02 is much higher than the one on 2022-01-05. The magnitude of the skew (together with the exact match percentage as seen in Figure 5) will inform you about how urgent the problem is.

Looking at averages of differences, instead of simple match percentages, is a big step forward but averages can be misleading—and trick you into false conclusions.

Percentiles of differences between training and serving, per feature

If you already monitor the rate of exact matches and the average magnitude, you have a good level of protection against train-serve skew. But you may be tricked by outliers and other extreme values, as these have a large impact on averages: one or two outliers may move the average value by a lot.

Understanding the behavior of features in the extremes (P99) is important for cases where you are only interested in the largest values—tails—of predictions, such as in credit underwriting, fraud detection, etc.

In order to guard against outliers we can monitor the percentile differences between train-time and serving-time features values, as seen in Figure 7 below. The skew on 2022-01-02 seems to have been strictly on the larger percentiles: it hardly had an impact on lower percentiles. The skew on 2022-01-05 affected all percentiles more or less equally.

Figure 7: Plotting the percentiles of numerical differences between train-time and serving-time values for a given feature. Mismatches affecting only the top percentiles (as on 2022-01-02) may indicate a couple of bad scores due to temporary issues, but no widespread problems. Mismatches that affect all percentiles (as in 2022-01-05) are usually indicative of logic-based skew and may be more serious.

We saw how to properly monitor feature mismatches; let’s now look at how to interpret these plots to detect when we have train-serve skew.

Detecting when skew is taking place

In our experience, most post-deployment mismatches happen because of upstream changes in services we fetch real-time features from. This assumes you’re working within a microservices architecture.

Post-deployment train-serve skew usually shows up as a sudden, sticky change in the monitoring plots. See an example in Figure 8.

Figure 8: Sample plot showing a “broken” feature. A sudden drop that doesn’t go back to normal levels is a clear sign that something fundamental changed.

And how do we detect pre-deployment skew? We just need to monitor data from a shadow-mode deployment, as explained in the section Monitoring Train-serve mismatches.

Understand the magnitude of the mismatch

To know how bad the mismatches are, we have to look at plots that show the differences in feature values. You’ll probably not bother trying to investigate every little mismatch, especially when the magnitude is very small, as it will likely have very little impact on the model prediction and, by extension, on the business.

Consider the impact of the feature mismatch on the business

Even if you have a feature with a lot of skew, there are cases where you’ll choose not to investigate it if the impact on the business is very small. This may happen, for instance, in cases where skew is present in a low-importance feature, so it’s not causing any real-impact on the business.

We can think of two ways to gauge the business impact of skewed features:

Monitor the skew in the prediction too, to see if feature skews have caused a skew in the model prediction as well. For example: do the feature skews cause the model prediction to be mistakenly higher or lower than it should be?
Monitor business decisions made using the model. This is what we call decision layer monitoring. If that indicates a problem on the same days you’ve had skew, it’s likely that the skew is impacting downstream business decisions so it’s more serious.

Debugging and fixing problems

Ok, so you detected your model is indeed suffering from train-serve skew, you understood the magnitude of the mismatch, and you saw that it is having enough of an impact on the business to justify working on it. You now need to debug and fix the mismatch.

Detecting train-serve skew can be done by looking at dashboards and plots. When debugging, however, you need to access the raw comparison data to understand the nature of the skew.

Debugging and fixing train-serve mismatches involves comparing features values used at train and serving time for individual instances, which is why we suggest you create a monitoring dataset where you can see the feature values for both data paths.

As mentioned previously, collecting data is a prerequisite for dealing with train-serve skew-–if you don’t collect training and serving data, you can’t monitor or fix train-serve mismatches.

In Figure 9 you can see another example of what a “monitoring dataset” should look like for a real time model having 3 features: “A”, “B”, and “C”:

Figure 9: Debugging train-serve mismatches usually requires you to look at the raw “monitoring dataset” and compare feature values from both data paths. The monitoring dataset is just a join between train-time and serving-time feature data.

Let’s now see how to prioritize investigations and see examples of common types of mismatches.

Focus on high-importance features first

If you have to deal with mismatches in many features, you need to choose which ones to focus on first, especially when dealing with pre-deployment skew—it’s common to have several skewed features as you first deploy your real-time model (hopefully in shadow mode so no damage is done!).

Use the feature importance (e.g. SHAP values) to decide which features to investigate first.

Understand the type of mismatch

We’ll now list common types of mismatches, along with the most common reasons and fixes.

They’re explained in the next subsections and summarized in the table below:

Type of mismatch	When it usually occurs	Possible fixes
Null values at serving-time	Pre-deployment and post-deployment	Depends on the reason. May need fixing or model retrain if it’s due to data leakage.
Null vs 0 Values	Usually Pre-deployment	Debug and fix serving-time feature logic (usually adding a fillna(0))
Serving-time value is usually higher	Pre-deployment and post-deployment	Debug and fix serving-time feature logic
Serving-time value is usually lower	Pre-deployment and post-deployment	Debug and fix serving-time feature logic
Examples not being scored at serving-time	Pre-deployment and post-deployment	Depends on the reason

Null values at serving-time

You assumed (at training time) some information would be available at inference time, but when you first deploy the real-time model all feature values are NULL.

This may be a form of data leakage—you used future information to build features. If this happens you will see NULL values at inference time, while you had Non-null values during training. You may need to remove these features and retrain the model. See an example below in Figure 10:

Figure 10: If you only have NULL values for a feature at serving time, it may mean that there was data leakage during training. If that is the case, you may need to drop this feature and retrain the model.

You should only suspect data leakage if all values for a feature are NULL at serving time; if only some values are NULL, the problem may be caused by runtime exceptions or timeouts.

Null VS 0 Values

It’s common that NULL values get mixed up with 0 (zero) during feature implementation. This happens with features based on counts, sums and averages. A common fix is to use fillna(0) to replace NULLs by zeros.

Example: Modeling was done with an R package which represents the sizes of empty lists as NULL. But at serving time features are fetched with Java code and the semantics there may be different—counts of empty lists get represented as 0 (zero) instead of NULL. See Figure 11 below:

Figure 11: This is a classic example of Null vs 0 skew; note that we only have mismatches in the examples where the actual value of the feature is 0—there are no train-serve mismatches is the other cases

Serving-time value is usually higher

When the values for a feature are consistently higher than they should be, we probably have a bug in filters and/or date-ranges used at serving time. See Figure 12 below.

Example: One of the features in a fraud model is the number of settled credit-card purchases a customer made in the last month. However, the feature has been wrongly implemented at serving time—it’s using all purchases instead (settled and otherwise), so the numbers are sometimes higher than they should have been.

Figure 12: When feature values are consistently higher than they should be, this can indicate that the serving-time feature implementation is using overly loose filters and including more information than it should.

Serving-time value is usually lower

This is analogous to the previous mismatch type—with lower values instead of higher values than expected. See Figure 13 below to see what this looks like.

Example: A realtime credit-scoring model has a feature called “num_transfers_last_day”, which contains the number of transfers a customer made in the last 24 hours. However, the engineer tasked with implementing it at serving-time thought that it meant the number of transfers in the current day (i.e. starting from 00:00 until the present moment).

Figure 13: In this case, we see that many instances have lower values than they should. Again, this may be due to a wrongly implemented filter, off-by-one errors, etc.

Examples not being scored at serving time

Sometimes real-time models end up scoring event distributions that did not exist in the training data. This is dangerous because we cannot trust the predictions given to instances sampled from a different distribution than the model was trained with.

Example 1: A bank trained a model to score the credit default risk for the first loan ever given to a customer. However, engineers mistakenly deployed the model to also score subsequent loans.

Example 2: A fraud model was trained to detect fraud attempts in credit-card purchases made online. However, an engineering team mistakenly deployed the model to score in-person purchases as well.

This type of skew is different from the previous ones because it doesn’t refer to feature mismatches, but to cases where the whole example shouldn’t be scored at all. In Figure 14 we see how this kind of problem would show up in the monitoring dataset: all feature values from the training data path will be NULL.

Figure 14: Here we see a case where two examples (IDs 0004 and 0005) got scored by the real time model, but they did not appear in the programmatically generated training dataset. It’s very important to use an OUTER join to join both datasets, so that examples in either data path show up in the monitoring dataset.

Note that this is different from data drift—the distribution didn’t change due to the simple passage of time, but due to differences in how the model is used.

General tips

Here are some other general tips and suggestions that may help you deal with train-serve skew.

If you can, use a feature store

With a feature store, the task of calculating features is delegated to a specialized system.

Modern feature stores support batch and real-time calculation, obviating the need to worry about train-serve skew. Such systems usually support write-once semantics, such that features are defined at a higher layer of abstraction, rather than being re-coded in production systems.

Floating-point precision differences

Sometimes the training and the serving data are different only by a few decimal places.

One of the situations where this happens is when you use different technologies (for example, Python and Java) to build train- and serving-time features. It can happen that the floating precision is different in both and you’ll end up with mismatches such as the ones shown below in Figure 15:

Figure 15: Floating-point precision differences such as those shown are usually not indicative of a real problem – they reflect the way different technologies deal with floating-point numbers

These minute differences don’t usually count as real mismatches—they usually have no impact on the model outputs. Apply some slack when comparing floating point values to avoid wasting time on this.

Name features accurately

Well written feature names help avoid misunderstandings between modeling and engineering teams. The table below shows examples of good and bad feature names.

GOOD	BAD	Reason
num_transfers_last_24h	num_transfers_last_day	last_day is ambiguous: does it mean the last 24 hours? Or the current day (00:00 until now)?
num_settled_purchases_120d	num_purchases_120d	If the feature only includes settled purchases, it should be clear from the feature name.

Tight communication between data scientists and machine learning engineers

As we mentioned earlier, data scientists (DSs) and machine learning engineers (MLEs) are key to taking a machine learning model to production, taking care of modeling and real-time implementation, respectively.

They should be part of a single team—such as a squad. If different teams are responsible for modeling and implementing models, chances are higher that there will be misunderstandings that cause train-serve skew.

You can probably use sampled data for monitoring

Monitoring ML models is time consuming and expensive. You don’t need to use all data for train-serve skew monitoring.

If you do use sampled data make sure it’s a deterministic sample so that samples in both data-paths (training and serving) are included. Hash-based sampling is one way to do this.

Shadow-mode deployments

Shadow mode deployment refers to fully deploying a real-time model, except using its predictions to actually make decisions. This can be done with a feature-flag (or a simple if-statement).

You can use shadow-mode deployments to test for train-serve mismatches—without causing any damage to the business, because the predictions will not be used.

Conclusion

In summary, train-serve skew is a significant issue in real-time machine learning models, arising from differences between the training and serving environments. This can be due to miscommunication between teams or unexpected changes in data sources.

To mitigate this, it’s vital to monitor and debug mismatches, with a focus on high-importance features. This involves collecting and comparing feature data from both training and serving paths, and addressing issues as they arise.

Using a feature store can simplify this process by outsourcing feature calculation to other teams. Also, maintaining good communication between data scientists and machine learning engineers can prevent misunderstandings leading to skew.

Deploying models in shadow-mode—where predictions are not used for decisions—can help test for train-serve mismatches without business impact. By addressing train-serve skew, businesses can ensure their real-time ML models are more accurate and reliable.