Introducing fklearn: Nubank’s machine learning library (Part II)

Benefits from doing analysis and validation of models

Read Part I of this series here.

In the first part of this series, we talked about fklearn’s principles, and how it can be useful to build machine learning pipelines using pure functions. However, the most significant benefits of using fklearn come when doing analysis and validation of models, as we’ll see next.

Extensible Validation

For models that go into production, validation often requires a lot more than just finding out the value of some metric on a holdout set, or comparing a metric to a benchmark. We often want to answer questions such as:

  • How do we expect the model to perform over time, say six months after it has been trained?
  • Would more data help improve performance?
  • How often should I retrain?
  • Is it better to use more data or only recent data?
  • Does it perform equally well across user segments?

These may all sound like very different questions, but they can be answered by the same generalized model validation algorithm.

Fklearn’s generalized validation algorithm: Data are first split into chunks, a training function is applied to some subset of data, predictions are made on another subset, then an evaluation function is applied to the predictions.

In fklearn, this algorithm is implemented by the validator function:

The validator function in fklearn (inside fklearn.validation.validator) implements the generalized validation algorithm described here.

  • The train_fn is a learner function, exactly as we’ve defined in the last blog post.
  • The split_fn controls how the data will be split into training and testing sets. They receive DataFrames and return lists of training and testing indexes.
  • The eval_fn defines which metrics will be computed on the testing sets. More on evaluation functions later.

The validator returns a set of logs containing the results of all evaluations made on the testing sets. Analysis can then be done by extracting the data from these logs (fklearn provides helper functions for doing so).

By merely swapping the splitting and evaluation functions, we can simulate and evaluate many different real-life scenarios, which helps us answer those questions.

Let’s look at an example, using the learner we defined in Part 1, in which we try to answer these two questions:

  1. We have two different user segments, A and B. How different will model performance be between the two segments?
  2. We expect this model to keep running for a few months after it is trained. How much do we expect model performance to degrade 3, 6, or 9 months after training?

A full example of training and evaluation.

Let’s go over this example in more detail.

Splitting Functions

In the example above, after we define the learner, we define two splitting functions:

  1. A K-fold cross-validation splitting function. It splits data randomly into k separate sets and is used for performing standard cross-validation. Cross-validation is useful for getting a basic estimate of model generalization performance, within the same time as it is trained.
  2. A stability validation splitting function. It works by splitting data based on the time_column parameter. The model will be trained using data up to training_time_limit and validated on data split monthly after then. Stability evaluation is useful for estimating how the model will generalize in time, beyond the training period.

Fklearn comes pre-packaged with many splitting functions for common use cases (you can find a full list and descriptions of when they are useful here). Most splitters are designed to simulate real-life situations, where usually models are trained on data from one time period but applied in the future. Simple cross-validation is frequently insufficient for real models.

Customizable Evaluation

This comes up often in model evaluation: single, global metrics often don’t tell the full story. We might want to isolate the effects of model rank ordering and model calibration or look into performance in specific subgroups of our population. We might also be interested in the evolution of a particular metric, rather than just a point in time estimate.

For this, we need to simultaneously evaluate multiple metrics, split across dimensions (e.g., time, customer segment). Fklearn enables this by allowing us to define “evaluation trees”, combining individual evaluation functions. Looking back at our example, here’s how we defined the evaluation function:

Example of defining an evaluation tree.

This code snippet leads to the evaluation tree shown below:

Tree representation of the evaluation function defined above.

Once final_eval_fn is applied to data, the entire tree of evaluators runs, and all results are returned in the log. This means both r2 and Spearman correlation will first be computed for everyone, then separately for each user segment. These “evaluation trees” can be very powerful, allowing Data Scientists to automate recurring analysis.

Analyzing results

As you may have noticed, most operations in fklearn return logs. These logs concentrate valuable information, be it model parameters, dataset metadata, or validation results. Fklearn can be very verbose with logging, as we’ve found ourselves regretting not saving information often.

Fklearn also provides helpful functions to extract data from these logs (they can become quite large), and it is easy to create evaluation plots using only the logs. This allows us to build a generic evaluation code that receives logs from training runs and generates dashboards with model performance, speeding up the iteration process. Here’s an example of extracting data from the logs:

Example of using fklearn’s extractors to get results from logs.

Sample stability curve plot, after extracting the data from the logs. It shows model performance (Spearman correlation between prediction and target) over time, split by segment. It would also be possible to plot R2 from the same data.

Functional bliss

As a final note on validation, notice that the learner we defined will be used again and again inside these validator calls, training several models. We get the peace of mind of knowing it is a pure function – so nothing that happens inside validation can change our model definition – and that, when validating our model, all the steps in our pipeline are being applied consistently to all the data folds. Ultimately, this means that our final model, which goes to production, accurately matches the models going through all these validation scenarios.

The same goes for tuning or feature selection. For both, fklearn provides functions that are similar in spirit to the validator: you define how to split data, how to evaluate model performance and reuse your learner function.

What’s next?

This post concludes our brief introduction to fklearn. For more examples of fklearn’s model validation capabilities and other powerful tools, check the documentation here. We also hope this gets you excited about trying fklearn yourself.

Enter your name

Receive the newsletter