Nubank is a data-driven company that has Data Science as one of its four pillars, together with Technology, Design and Customer Experience. In practice, data science models have been making automated decisions since the company’s very beginning; state-of-the-art data infrastructure and platform allow any Nubanker to manipulate data.
Growing product offerings translated into an increase in data collected, which made finding the right data more and more time-intensive. So on top of easy and secure access to data, Nubankers need ready-to-use curated quality-checked data, in order to focus on the analysis and the applications of their data rather than on looking for the right data.
Let’s take a simple example. Say you need the number of customers for your analysis. You could use the number of people who have registered, those who have made a first transaction or transfer, those who were active last month, those who haven’t churned or were not blocked for fraud, or who did not default.
How do you know all the options and which you can compare with others’ analyses?
In order to tackle data quality and ultimately to increase Nubankers’ efficiency to get data and make the best data-driven decisions, we recently created a new layer of quality-checked data in our data warehouse, with a new central team to coordinate its adoption and sustainability.
In this text, we will:
- give some context about Nubank’s previously decentralized approach to data quality;
- explain how it led to the challenge of achieving both efficiency and data quality;
- present the “core data” layer and framework we introduced in January 2020 together with a central team to increase both efficiency and quality.
Part 1: Nubank is a data-driven, fast-paced, decentralized organization, with a self-service data platform
We work in a company that has experienced tremendous growth: 7 years after launching, Nubank boasts over 25 million customers, more than any other fintech in the world excluding Asia.
It has achieved this impressive growth with a very agile organization, both in terms of people organization (about 100 autonomous teams) and engineering architecture (about 500 microservices). This means multi-functional teams (squads) focused on one specific project or goal of the company, together with some horizontal central teams.
The Horizontal Data Team had been focused on democratizing the access to data by implementing data flows (ETL) infrastructure, integrating with querying / BI tools and enforcing the protection of data privacy. However, as each squad was given freedom to create their own tables, analysis and data science models, the quality of the data was varying a lot.
So the execution of the ETL (Extract Transform Load process, to transfer data from our production microservices to our Data Lake and Data Warehouse) is the responsibility of the Horizontal Team; however any squad can add a new source of data to the ETL, or a new table, just by using the self-service tools developed internally.
With a strong quant and engineering mindset amongst Business Analysts, adoption of the data tools was very quick and broad. One impressive thing at Nubank is that hundreds of Analysts know how to write SQL, Scala, use git and write tests for their transformations.
In many companies, adding a transformation to the automated ETL is a bottleneck process, so the number of automatically running tables are limited. At Nubank however, this process has been democratized for everyone to contribute with very little friction.
Our cutting-edge self-serving data tools were “victims of their own success”: in a few years, not only did we have thousands of raw tables but also thousands of transformed data tables – and growing!
Part 2: Decentralized analytics incurred data consistency and analytical productivity challenges
Quality checks vs. speed of execution
All Nubankers can use our self-service data platform to manipulate data, being given the responsibility to check the quality of the data they are using and publishing.
However, checking quality of data became more and more time-intensive because of the increase in both raw and transformed data tables:
- raw data (20K tables) with the diversification of our financial products offered from a credit card product to reward points, savings accounts and loans;
- transformed tables (10K) in our analytics environment with the growing number of contributors (~500 among which analytics engineers, analysts, data engineers and data scientists).
The consistency of concepts and metrics definitions across the company also became challenging, as the number of teams grew to a hundred and needed to coordinate and collaborate in the best way.
For each data usage, there needed to be repeated quality-check tasks: making sure the data consumed is understood (readability), correct (accuracy) and coherent (consistency). So ultimately speed of delivery (productivity) of data analytics was affected.
There was a tradeoff to make between speed of delivery and quality of the data depending on the use-case. Prioritizing quality for regulatory reports, high-exposure data science models or strategic decisions, while prioritizing speed for ad-hoc analyses or internal-only reports.
Even though we had thoroughly-checked tables, it was not obvious to distinguish them among others. Moreover quality-checked data might become obsolete over time if not maintained, and some quality tables might not be coherent with the rest in terms of naming conventions.
Identifying quality and efficiency gains
It became clearer that there were huge efficiency gains opportunities as many of those repeated quality-check tasks across hundreds of teams could be done once centrally, leaving more time for analysts and data scientists to focus on the analytics and models than on searching for the right data.
As we realized that we could significantly increase efficiency, consistency, accuracy and understandability of data analytics by coordinating data quality in a central team, the Analytics Productivity team at Nubank was created in January 2020.
Our goal was to bring this efficiency while maintaining the analytical liberty of analysts across the company. As much as hundreds of teams’ autonomous goals should be coherent with the overall company goals, we envisioned that metrics, models and reports should be consistent across the company despite being generated by hundreds of independent microservices.
Part 3: Introducing a company-wide “core data” layer with a central team to coordinate its launch and sustainability
As the new Analytics Productivity squad took ownership of the data quality for the company, it addressed the data warehouse organization together with a set of rules (naming conventions, documentation requirements), processes (contribution and maintenance) and communication initiatives.
A new “core” layer in the data warehouse with some principles
The team created one “core” folder (ie. layer, schema, container or namespace) to clearly differentiate quality-checked tables from the other open-access (read and write) folders.
We applied stricter rules to the “core folder” only, while keeping the rest of the data warehouse in a self-service low-constraints read-write environment. This was critical in order for teams to maintain the data usage liberty that has been a catalyst for Nubank’s growth.
Also, we acknowledge that different layers of data need to coexist as not all tables are meant to be used for more than one use-case or team. We aim for core data tables to be the quality-checked reusable data source for more business-application-specific layers.
The main principles of “core” layer are:
- The data should be accurate (or as accurate as possible) and maintained to stay accurate.
- The metrics should have coherent and consistent naming conventions and definitions across all tables in that layer.
- Any Nubanker should find and use source-of-truth, “master” data for the main business objects and business processes.
- Tables would be calculated either based on raw data or other core datasets to make sure the accuracy would be maintained.
- There is always an up-to-date documentation with every table to be sure that its consumers will understand the content in a quick way.
Rules and processes to make it a sustainable reality
In order to make sure the principles were followed, we defined rules and enforced them technically: to add, modify or delete a dataset in the core layer, you can only do it from a specific folder in our code repository with stricter contribution rules.
Then, we have maintenance processes monitoring potential issues in the tables of that core folder, with a responsible owner to fix it.
Communication to involve the whole company
As we are tackling a growing need from our data consumers, we expect them to use our core datasets. We also expect teams to help contribute when it’s their domain of expertise, or when they own tables that are already fulfilling our criteria that can be transferred to the core folder; while the central team coordinates that the contribution principles are fulfilled.
With communication and trainings, we aim to boost contribution and consumption throughout the company, as well as making sure we are close to the needs of our data consumers.
We started the team with Analytics Engineers (AEs) and Business Analysts (BAs).
The AEs were more focused on the implementation, leveraging dimensional modeling concepts to design robust architecture and tooling, the execution performance, the monitoring, and the technical rules enforcements in our repository. They were also key to clarify some intricate calculations to make rigorously sure they were correct.
The BAs were more focused on being closer to business needs and understanding which data to add first, helping to align on metrics definitions when there was no clear consensus, and document our processes that generate this data.
Moreover, they also integrated this core data concept with our data visualization tool, not only in the way data was organized in it but by providing some dashboards and insights based on those company-wide metrics, on which the considerable traction gave visibility to our project.
The main impact is regarding productivity of getting data and using it. However measuring the increase in productivity is a challenge in itself, as measuring the exact time to look for data changes is impossible. Our approach so far is to have a proxy with surveys to data consumers.
A second impact is on having metrics and definitions accepted as “source-of-truth” by the whole company. We measure that by looking at how many teams are using our core tables, and how many queries are run on our core tables vs “competitors”, ie. similar tables that are not checked.
Finally we also measure the performance of our tables, both in terms of execution performance (time to load) and in terms of issues that had to be fixed reactively.
Remember the example at the beginning of this post? We wanted to get the number of customers. One of the core datasets we created is a table on a customer grain, with the snapshot of the most recent information about him/her. It has a column for each most common option to get this value, and a documentation explaining what exactly every option is taking into account.
You can easily compare with other reports that use the same column, and if you are going to use that data for a regular report you need to send, you can rely on that data as it is going to be monitored and maintained.
This is now much more straightforward and quicker than it would have been before, where you would have first asked people or tried to compare multiple different calculations from other reports.
As with every platform that enables users to consume and contribute with the least constraints possible, success leads to amounts and variety of content so large that curating this content becomes a key challenge to tackle to keep the platform as relevant and user-friendly as possible.
The core data framework is a way to separate one category of tables that follow a strict set of rules, and implement processes to make sure they are maintained over time and that contribution keeps a pace that follows our new data needs. Communication is also key to do across the company to make sure the initiative is understood and known.
Some of the challenges we had while implementing were about:
- What are the steps to get to a comprehensive “core data” folder? How do we choose and prioritize tables to add to the folder? What is the critical mass of tables that should be a milestone?
- How do we launch and define maintenance owners for each table added to that folder? How can we detect new issues in a core dataset ?
- How do we define what are the columns that are shared across the company?
- What are the KPIs we monitor ? What does that entail on the data visualization tools?
We hope to share our answers to some of those questions in some upcoming articles!