Introducing fklearn: Nubank’s machine learning library (Part I)

In April 2019, Nubank open-sourced fklearn, our machine learning python library. And it has a a benevolent sweet good poison!

Fklearn: Nubank’s machine learning library

Read Part II of this story here.

At Nubank, we rely heavily on machine learning to make scalable data-driven decisions. While there are many other ML libraries out there (we use Xgboost, LGBM, and ScikitLearn extensively, for example), we felt the need for a higher-level abstraction that would help us apply these libraries more easily to the problems we face. Fklearn efficiently wraps these libraries into a format that makes their use in production more effective.

Fklearn currently powers a broad set of machine learning models at Nubank, solving problems that range from credit scoring to automated customer support chat responses. We built it with the following goals in mind:

  1. Validation should reflect real-life situations
  2. Production models should match validated models
  3. Models should be production-ready with few extra steps
  4. Reproducibility and in-depth analysis of model results should be easy to achieve

Early on, we decided that functional programming would be a powerful ally in trying to achieve these goals.

F is for Functional

Here at Nubank, we’re big fans of functional programming, and that isn’t limited to the Engineering chapter. But how does Functional programming help Data Scientists?

Machine Learning is frequently done by using object-oriented python code, and that’s the way we used to do it at Nubank as well. Back then, the process of building machine learning models and putting them into production was tiresome and often full of bugs. We’d deploy a model only to find that predictions made in production didn’t match the ones seen during validation. What’s more, validation was often impossible to reproduce, frequently being done in stateful Jupyter Notebooks.

Functional programming helps fix these issues by:

  • making it easy to build pipelines where the data transformations that happen during training match the models in production;
  • allowing for safer iteration in interactive environments (e.g., Jupyter Notebooks), preventing mistakes caused by stateful code, and making research more reproducible;
  • allowing us to write very generic validation, tuning, and feature-selection code that works across model types and applications, making us more efficient overall.

Let’s go through an example to see how functional programming does this in practice. Let’s say we’re trying to predict how much someone will spend on their credit card based on two variables: monthly income and previous bill amount. As the output of this model will be used for sensitive decision making, we’d like to make sure it is robust to outliers in the input variables, which is why we decide to:

  1. Cap monthly income to 50,000, since income is self-reported and sometimes exaggerated.
  2. Limit the output range of the model to the [0, 20,000] interval.

And then use a simple linear regression model. Here’s what the code looks like:

Don’t be alarmed! We’ll go through the code step by step, explaining some important fklearn concepts.

Learner functions

While in scikit-learn, the primary abstraction for a model is a class with methods fit and transform; in fklearn, we use what we call a learner function. A learner function takes in some training data (plus other parameters), learns something from it, and returns three things: a prediction function, the transformed training data, and a log. The first three lines of our example are initializing three learner functions: capper, linear_regression_learner, and prediction_ranger.

To better illustrate it, here’s a simplified definition of the linear_regression_learner:

Notice the use of type hints! They help make functional programming in python less awkward, along with the immensely useful toolz library.

As we mentioned, a learner function returns three things (a function, a DataFrame, and a dictionary), as described by the LearnerReturnType definition:

  • The prediction function always has the same signature: it takes in a DataFrame and returns a DataFrame (we use Pandas). It should be able to take in any new DataFrame (as long as it contains the required columns) and transform it (it is equivalent to the transform method of a scikit-learn object). In this case, the prediction function simply creates a new column with the predictions of the linear regression model that was trained.
  • The transformed training data is usually just the prediction function applied to the training data. It is useful when you want predictions on your training set, or for building pipelines, as we’ll see later.
  • The log is a dictionary and can include any information that is relevant for inspecting or debugging the learner (e.g., what features were used, how many samples there were in the training set, feature importance, or coefficients).

Learner functions show some common functional programming properties:

  • They are pure functions, meaning they always return the same result given the same input, and they have no side-effects. In practice, this means you can call the learner as many times as you want without worrying about getting inconsistent results. This is not always the case when calling fit on a scikit-learn object, for example, as objects may mutate.
  • They are higher-order functions, as they return another function (the prediction function). As the prediction function is defined within the learner itself, it can access variables in the learner function’s scope via its closure.
  • By having consistent signatures, learner functions (and prediction functions) are composable. It means building entire pipelines out of them is straightforward, as we’ll see soon.
  • They are curriable, meaning you can initialize them in steps, passing just a few arguments at a time (this is what’s actually happening in the first three lines of our example). This will be useful when defining pipelines and applying a single model to different datasets while getting consistent results.

It may take some time to wrap your head around all this, but don’t worry, you don’t need to be an expert in functional programming to use fklearn effectively. The key is understanding that models (and other data transformations) can be defined as functions following the learner abstraction.

Pipelines

Machine Learning models rarely exist on their own, however. By focusing only on the model, Data Scientists tend to forget what transformations data are going through before and after the ML part. These transformations often need to be exactly the same when training and deploying models, and Data Scientists might try to manually recreate their training pre- and post-processing steps in production, which leads to code duplication that is hard to maintain.

Learner functions are composable, meaning two or more learners combined can be seen as just a new, more complex learner. This means that no matter how many steps you have in your pipeline, your final model will behave just the same as a single one, and making predictions is as simple as calling the final prediction function on new data. Having all the steps in your modeling pipeline contained in a single, pure function also helps with validation and tuning, as we can pass it around to other functions without fear of side effects.

In our example, our pipeline consists of three steps: capping the income variable, running the regression, then constraining the regression output to the [0, 20,000] range. After each learner is initialized, we build the pipeline and apply it to the training set using these two lines of code:

The learner variable now contains the pipeline resulting from composing the three learner functions and is applied to the training data to yield the final prediction function. This function will apply all the equivalent steps in the pipeline to the test data, as the image below illustrates:

Example of how data flows through a pipeline when training, and through a prediction function when predicting. The prediction function itself is returned by the pipeline; it is the composition of the three prediction functions generated by each learner when the pipeline was first called on the training data. The logs are a combination of the logs coming from all learner functions in the pipeline.

What’s next?

We’ve seen how models and data transformation steps can be written as learner functions, and how functional pipelines in fklearn help us ensure that transformations done during training and validation match those done in production.

In Part II of this blog post, we talk about model validation and analysis, and the tools fklearn provides to make those steps more effective.

In the meantime, we invite you to try fklearn for yourself! We don’t expect fklearn to replace the current standards in ML, but we hope it starts exciting conversations about the benefits of functional programming for Machine Learning.

Enter your name