As data analytics is becoming more impactful, data teams grow larger and datasets get more numerous. This entails new challenges of collaboration and efficiency, as well as quality and discoverability.
At Nubank, the largest digital bank in the world with 40 million clients, data is at the core of our business, from making automatic credit underwriting decisions, sending personalized communications or generating regulatory reports. We have a self-service data platform where everyone can query any non-restricted data (such as non personally-identifiable data), create tables and add them to our pipelines. In 8 years we grew to around 800 contributors who added 40K tables to our data pipelines.
In order to keep using data in an agile and productive way with the growth of data users and usages, we have seen two factors are key: the team organization, and engineering standards to enforce collaboration good practices.
In terms of team organization, we created the Analytics Engineering role, a dedicated function for data strategy and management, in terms of quality, privacy, reliability, architecture and costs. The Analytics Engineers are distributed in most product and cross-functional teams. They work closely with a data platform team focused on infrastructure and tooling.
In terms of engineering standards, we have built technical processes and tools that enable us to make the organization thrive, empowering data contributors to work autonomously and productively, leveraging the work of others over time.
What are those engineering standards enabling to work with data more efficiently and collaboratively?
In this article, we’ll walk you through how we use software engineering standards to :
- Part 1: Enable autonomous and structured contributions
- Part 2: Leverage team work and collaboration
- Part 3: Simplify the data lifecycle ownership
Part 1: Enabling autonomous and structured contributions
In order to scale our contribution process to the data pipelines, we tried to find the right balance between total freedom and a tedious framework. Today any trained Nubanker can create a table and add it to the pipeline in a couple of hours or less, nearly fully autonomously, with a streamlined peer review process. It enables us to iterate quickly on data work.
Standardised structured objects
Our whole data system is based on many standardised components in Scala. Every table is created via a specific standardised object with structured attributes.
A data table object in our system is currently composed of three main types of attributes:
- The query declaring the transformation
- The list of table inputs used in the query
- Structured metadata such as the name of the resulting dataset, the description of the dataset and of each column, the dataset owner, the clearances for data protection, which layer of quality this dataset belongs to, etc.
The list of inputs allows us to automatically add the table to be run in the pipeline at the right place in the dependency graph.
As for the metadata, depending on the quality layer of this table, all or only part of the metadata attributes will be required, in order to make sure we get good documentation for high quality datasets, such as core datasets, and limit friction to create experimental datasets. The required clearance metadata also allows us to easily configure granular access management.
Naming conventions, metrics reusability and files organization
In order to facilitate data discovery and encourage metrics coherence and consistency, we implemented naming conventions not only for the tables themselves but also for the metrics.
We also created frameworks that enable us to define a metric once (both its calculation code and metadata) and re-use it in multiple datasets.
Finally, we defined a specific file structure organization in our repository to make sure that code discovery is easy, that all the code that needs to be executed by our data pipeline is at the right place, and that we have the right accesses and meet the requirements for change management.
Quality Testing Process
In order to enable anyone to write any query and add it to the pipeline, we need to make sure some quality checks are done.
We are quality testing at two levels: transformations of the query itself (unit tests) and general fit with the rest of the system (integration tests).
First of all, there are multiple small tests called unit tests, written by the data contributors, checking that the results are the ones expected by the transformations.
We write our queries in Scala, which enables us to divide our transformations into small functions that can (and should) be unit tested. In order to test Spark transformations, we use the Holdenkarau’s testing suite that gives us the base classes necessary to run spark queries. A unit test usually looks like this:
Second, we need to check that if we add this table to the data pipeline, it fits well with the rest of the system and nothing else is going to be affected negatively. Those are called integration tests. To publish a new table or make changes to an existing one, all the integration tests must pass. They check that the inputs all exist, whether that’s another table or a column of table, or that no inconsistencies could come up at compile time. These integration tests are written by the central platform team, they don’t need to be rewritten for each new table or change.
Part 2: Leveraging team work and collaboration
Knowledge sharing in code reviews
The peer review finalizes the process of contributing to our data pipeline. It helps making sure the code is of good quality and following best practices, as well as giving async practical feedback to the contributor, and training the reviewer to explain his/her feedback. Overall, knowledge is spread smoothly through those reviews.
There are two reviews to add a new object to our data repository: one to check the business logic, and one to check the Scala, Spark and testing best practices.
Analytics Engineers and other senior contributors take turns splitting the technical review responsibilities. A good part of the review focuses on making sure the code is clear and readable, broken down into small functions, with clear unit tests. We usually are able to review every pull request within one day.
Reusing code with libraries
Another standard for working on large engineering projects is to make sure contributors build upon what the others have already done, to be more productive and focus on the new additions.
One of the reasons we chose to have our data codebase in Scala is to empower data users to create, and build upon libraries containing the functions developed by others in the past. Teams use, create and contribute to libraries in order to reuse the same structures and definitions across multiple datasets, as well as generating series of similar datasets programmatically and adding them to the ETL pipeline.
For example, our Experimentation Platform automatically generates datasets with many metrics for each A/B test so decisions can be made really quickly without an analytics bottleneck.
This increases the velocity of delivering new datasets and also simplifies maintainability and consistency, having centralized logic maintained once for multiple datasets. Overall, it increases the efficiency and quality of analytics.
Documentation and discoverability
A key part to collaborating effectively is enabling anyone to find the right information in an easy way. We work on having documentation available at different levels: on the contribution process and our data platform, on the goal and owners of datasets, and on all the columns and metrics for each table.
We are building internal tools and dashboards to make this documentation and metadata easily discoverable and browsable so that Nubankers can be most effective with data work. One example is Compass, our internal data search engine that gets information from the metadata we generate.
Part 3: Simplifying data lifecycle ownership
Monitoring using metadata
We need to make sure the system works as a whole. Monitoring allows us to have data governance plans and OKRs, and make sure that everything is consistent and with good quality over time.
Overall, we monitor the quality of data generated by the query, the optimization of the query, the user experience and governance aspects around data ownership, lineage dependencies and access restrictions. Our monitoring is based on our declared metadata from our structured objects as well as automatically calculated metrics like row count, avg value of a column, time table was available or time to execute a query.
We show this information on dashboards as well as alerts that directly send a message on Slack to the responsible people triggered by some criticality rules.
Maintaining datasets over time
Users usually expect tables in the data pipeline to stay correct over time. However, iterative product innovation is often correlated with changes in the data collected or its structure. Maintenance needs can come up proactively because of new product developments and services refactoring, or reactively through alerts from our monitoring.
We need to make sure that when products evolve, we have the right tools to maintain the tables impacted downstream.
Like in software engineering, documentation, unit tests, integration tests and monitoring checks are guardrails to make sure the datasets calculation can be updated by anyone without causing issues or changing the main goal defined by the initial creator of the table. Even though we have data owners for all our data objects, this system enables us to have operational efficiency as anyone can take turns at fixing data issues relying on the safety provided by the guardrails.
Thanks to metadata used to classify tables in different categories (core datasets, critical datasets, model inputs, etc) we have different expectations regarding how fast an issue should be solved and who should validate the changes.
User-defined data transformations as well as shared libraries are managed in a centralized git repository. Since we are using git, we also gain versioning capabilities, as well as simple interfaces for peer reviews. That means we have a traceable history of changes in a dataset definition. With this history, we can roll back changes if we need to, understand how this data was generated in the past or even run an older version of our data transformations. This versioning capability has proven itself one of the most important parts of our system, to roll back easily when things go wrong, and for auditing purposes.
More and more companies are investing in self-service platforms in order to remove the bottleneck of a unique data team. While those platforms definitely foster a data culture company-wide, this system also brings a lot of governance challenges in practice. A people organization with clear responsibilities and accountabilities is key to tackle those challenges. Those data teams can be all the more impactful with tools and practices that enable them to collaborate efficiently.
Creating a modular platform with structured objects, naming conventions, libraries and testing practices; enforcing peer review, documentation; making sure we could monitor, maintain and roll-back were crucial to scaling our data team and seizing opportunities.
Data has been playing a key role in Nubank’s growth, and we probably wouldn’t have been able to achieve so much so fast without this system.
Of course, investing in such tooling comes at a cost – time invested at the start to design and implement the system, more time for new joiners to ramp up, and a marginal increase of time spent on some extra requirements (testing, documentation, review) when contributing. We found that this investment was quickly worth it as we grew our teams really fast (from 0 to 800 data platform contributors in 8 years). We would not have been able to have so many individual contributions so rapidly without breaking the system and it would have been much more complicated to monitor quality.
Empowering users to innovate both autonomously and collaboratively is key for the success of data teams. Finding the right amount of tooling and requirements in order to give the maximum flexibility to users while having governance guardrails is critical to scale analytics efficiently and confidently, and there is a lot to learn from software engineering best practices.