Not so long ago, machine learning applications started blossoming across various industries. Companies swiftly adapted their existing infrastructure in order to be able to ship machine learning models that would generate predictions in batches.
It works well enough. Now, it seems that everyone is talking about how real-time machine learning is the future. But is it for real? Should we really go through the extra effort required to put (and maintain) real-time models in production?
Are you interested in a mini roadmap to help you reason about when and how to build real-time machine learning models? Keep reading this article!
What are real-time models?
There isn’t a single definition of what a real-time model is.
When we talk about real-time models, we might think about two distinct processes that can happen in real time: inference or learning.
The first thought that likely comes to mind is that real-time means real-time learning, where a model continually receives new training data and updates its parameters. This is certainly exciting, but still rare to find in real life. However, real-time also means real-time inference, where a trained model is able to receive requests at any time and return predictions synchronously. This is more commonly found in real life.
In this article, we present strategies for building models that make predictions in real time.
Why use real-time models?
Sometimes you have to use a real-time model simply because the problem you are tackling requires instant decision making.
Suppose you were asked to come up with a ML-based solution that will help scale and improve user service. Every time a user opens up the chat in our app and writes a message, we should automatically identify what they are talking about and take action accordingly (e.g. redirect them to a chat with a human specialist).
We might frame this problem as a multi-class classification problem, in which classes could be different products (such as credit card, savings account, investments, and so on), and build a classifier that receives the text written by the user and returns the product they are most likely to be talking about.
This use case requires a real-time model because the model feeds off freshly generated data, and also because the user expects a quick answer. From that specific use case, we can try and come up with some rules of thumb to help us decide when using a real-time model is ideal (or required):
- Improved user experience. There are several situations that are similar to our use case, where a synchronous response is expected. There are also other use cases where the model needs to be embedded on a mobile device and thus make real-time predictions. That could happen, for instance, if the model is required to generate predictions even when there is no internet connection.
- Use of fresher data. Batch models make predictions from data that are at least a few hours old. Real-time models, on the other hand, are able to make predictions from data that are only seconds old, such as the text the user has just written or their current location.
- Unknown (or very large) set of inputs. Batch models make predictions from a predefined set of inputs. For instance, we may have a fraud model that runs in batches and generates one fraud score for each user; in this case, the set of inputs corresponds to the set of users. Sometimes it is not feasible to know beforehand what all the possible inputs are. In our use case, the set of inputs would be the set of all texts that users can write, which are mostly unknown.
- Efficient use of resources. Batch models generate predictions for all possible inputs, even when most of those predictions are not being used for decision making. Real-time models, on the other hand, generate predictions for one input at a time, only when they are needed.
Alright. There are tons of good reasons for building real-time models. Now it’s just a matter of deploying them. Should be quick and easy, right? We could do that in less than 10 lines of code:
Not so fast.
How to build real-time models?
Since infrastructure constraints are likely to impact modeling decisions, building a real-time model requires very close collaboration between the data scientist and the machine learning engineer.
We will talk about two requirements that we need to keep in mind from the start of the development of a real-time model: real-time pipeline and fast inference.
Real-time Pipeline
A real-time pipeline should gather and prepare all inputs required by the model. Data may be collected from different sources:
- Request payload. In the simplest scenario of our use case, we could have the text written by the user as our only input. Right after the user executes an action, such as writing a message and pressing the send button, data are added to a payload and directly sent to the model via a request.
- Streaming events. If we want to enrich our model by adding more information on the recent behavior of the user, we could create a feature that includes the last screens the user has seen before opening the chat. These data would not be available in the context of our chat model microservice, nor in the context of an analytical environment. Because they are very fresh data, we would need to fetch them from streaming events generated by other microservices.
- Feature store. We might want to further enhance our model by also adding information on the history of the user. We could create a feature that tells us the products the user has used the most in the last 60 days or so. This feature would be generated by aggregating historical data. It could be time-consuming to generate it in real time, so ideally it would be pre-generated in batches, and thus we would fetch it from a feature store.
After gathering the data, we still need to preprocess them. Historical data coming from the feature store are already preprocessed, whereas fresh data coming from the request payload or from streaming events are in their most raw format. Now, we can clearly see that we have two separate pipelines: a batch pipeline and a real-time pipeline.
We want to make sure that the preprocessing function that was applied to data in the batch pipeline during training is exactly the same function that is applied to data in the real-time pipeline during inference. Failure to do this is referred to as train-serve skew.
Fast Inference
Recall that super cutting-edge neural network you built that achieved like 99% accuracy for all classes? If you try and measure its prediction time, you might be surprised to find out that it can take many seconds. Even though it sounds fast, especially for a big neural network, it actually isn’t.
A response that is considered fast usually takes milliseconds. Think about how long the user would be willing to wait before they retry an action or simply leave the app.
Real-time models need to be fast. There are basically two ways of making them faster: using more powerful hardware, or building lighter models.
Using more powerful hardware (such as GPUs) seems like a reasonable quick fix, but it might be harder to maintain in the long term, since it would likely be a non-standardized solution and require closer monitoring. Moreover, overall response time might not be fast enough. If we had a heavy model that needs to run inferences on GPU, there would be considerable communication overhead between CPU and GPU.
On the other hand, building lighter models is more cost-efficient and easier to maintain. If we were using light models, we would be able to scale machine learning services horizontally just like regular microservices, possibly using already existing in-company tools.
Heavy models can be compressed using various techniques, such as:
- Pruning. Finding redundant weights in a tree or neural network and setting them to zero, that is, cutting some connections between nodes in the model. It is based on the assumption that a complex model contains several submodels, and thus pruning tries to find an optimal submodel.
- Knowledge Distillation. When a compact model learns how to mimic the behavior of a big model. It is used in the context of neural networks. We may either train a distilled model from scratch, or use pre-trained distilled models already available in libraries. One of such available models is DistilBERT, a distilled version of BERT that has only 60% of the size of the original model.
- Quantization. It consists of lowering the precision of the floating-point numbers that represent the weights of the model. It is usually done after training, since trying to optimize weights using low-precision values might lead to unstable training or even divergence. As an example, we may lower the precision of weights from 32-bit floating point numbers to 8-bit integers, which would result in a model that is 4x smaller.
It is worth noting that pruning and quantization are available both in TensorFlow and in PyTorch, so it should be quick and easy to run experiments combining different techniques.
Besides compressing models, we may also evaluate the effectiveness of using caching to store some predictions. In our use case, after preprocessing the text input, we might end up with an input that frequently repeats. In that case, we would call the model only the first time that input is seen; then in subsequent calls, we’d fetch the prediction directly from cache.
But… is this real life?
It sure is! Most companies begin their machine learning journeys by experimenting with batch models, since they are perceived as an easier and safer approach. However, as machine learning experts and business stakeholders work together to discover new areas where machine learning could be applied to maximize value, problems that require real-time models (such as the chat model we’ve talked about) inevitably arise.
Tons of companies are already shipping real-time machine learning models in a safe and scalable way, including Nubank. If you’re curious about what is possible to do with real-time machine learning systems, come join us.
To learn more about real-time models, check out the recording of Ana’s talk at the Building Nu Meetup:
Written by Ana Martinazzo
Reviewed by Felipe Almeida