What does a Data Analyst do, exactly?
This is a common question for us, and our answer frequently surprises people. At some companies, Data Analysts build Business Intelligence (BI) dashboards. At other companies, Data Analysts pull data, ‘run queries’ for others. That’s not what we do.
We are Nubank’s Data Analyst chapter and we’ve come to realize that our name communicates poorly what we do. Both internally and externally. Our goal is to make the whole company more productive with data by applying established software engineering principles and data modeling techniques to all of Nubank’s business domains.
For Nubank, data is crucial. Unsurprisingly for a modern fintech company, we use data everywhere. For example, one important use case is automatic decision making for underwriting and credit limits using machine learning (ML) models.
Not only do we use data for decision making. We also have regulatory reporting obligations to authorities like central banks and anti-money laundering entities — all while respecting data protection laws in all markets we operate.
We as Data Analysts are all allocated in squads, which are Nubank’s autonomous and multidisciplinary teams. As of April 2020, we have nearly a hundred squads, such as Acquisition, Billing, Lending, etc. A Data Analyst works to make other members of their squad more productive with data.
In addition to that, we free the rest of the squad from working on heavy data engineering tasks so they can focus on their specialties. Besides the work with the squad, all Data Analysts spend Fridays together to work on horizontal projects, one example of which is enabling cross-company data integration.
To understand better how we do that, we’ll first explain the data landscape pre-Data Analyst chapter. Second, we’ll introduce the Data Analyst chapter and show the focus and scopes of the chapter, comparing it to chapters that already existed. At last, we’ll evaluate the appropriateness of our chapter name both inside Nubank and outside, and consider alternatives.
Data landscape at Nubank pre-2019
The backend database you’ll typically find at Nubank is called Datomic, a rather uncommon technology. Nubank chose to use Datomic very early on because we believe this database has superpowers. One of these superpowers is particularly valuable for analytical purposes: the fact that Datomic is accumulate only. That means that, under normal conditions, the database only accumulates new data, and doesn’t forget (delete or modify) old data.
This is different from how people generally use SQL databases. The difference being that in Datomic you only INSERT new rows — you can’t UPDATE or DELETE. As a result, Nubank’s analysts and ML models have access to a wealth of historical data.
For a long time, starting since its beginning in 2014, Nubank managed to function well without a separate analytical system and specialized data team. In other words, Business Analysts, Data Scientists, and other data consumers pulled the data they needed themselves. They queried Datomic directly and ran the analysis on their own machine.
Besides the occasional muttering about Datomic’s alien query language Datalog (most people are used to the very commonly used language called SQL), this ‘analytics pipeline’ (or the lack of it) worked well initially. What increasingly did not work well was the declining query performance caused by a rapidly growing customer base.
Furthermore, the increase of data-hungry headcount added fuel to the fire. Analysts resorted to clever tricks like breaking their queries in pieces and concatenating the results on their machines. That works, but not for very long.
It’s safe to say that queries to our largest databases really started grinding to a halt in 2016. We realized we couldn’t continue this way and needed our data available in a database that specifically caters to analytical workloads.
At the same time, we disliked the idea of building a myriad of custom data pipelines and staffing data engineering teams to maintain them—commonplace necessities for attacking these kinds of data access problems. We didn’t need that, though, since Nubank’s universal use of Datomic allowed us to implement only one generic type of data extraction pipeline. Within a month, every service was connected using this pipeline and the source data started flowing into our data platform.
We decided to build a self-service data platform. At Nubank we like to invest in platforms and abstractions developed and maintained by specialists in horizontal teams, so that generalists in vertical teams can iterate fast at a high level of abstraction.
For example, Engineers are empowered to deploy their own software. We don’t have a DevOps team that takes care of this type of work. Similarly, our self- service data platform enabled people to create their own datasets (materialized views) on top of source data, and others can build new datasets on top of those, and so forth. We only really needed a small specialized infrastructure team to take care of the platform that processes the growing amounts of data.
The self-service data platform has a typical modern ETL architecture. We extract all the data from Datomic and save it to our data lake on a cloud block storage service. Then, we transform it from a Datomic log to a relational table (using what we call ‘contracts’ — see ‘Data extraction and decision making’ in this article). Finally, we load it into an analytical database that everyone in the company can access — from customer excellence to executives.
The platform was a success, at least in terms of rapid adoption. People were happy that they could access all data (again) and started contributing datasets to the platform (see our talk on the São Paulo DS&ML meetup for details). We also clearly improved in terms of query performance.
But over time, as more and more new datasets were contributed, the data lake was becoming messy. Analysts got confused by the large number of datasets with subtle differences and no clear indication of which one to use for their work. At the same time, there was no incentive for data platform users to invest in modeling and reuse, which would organize the mess. Sadly but unsurprisingly, the data in our data platform was becoming a ‘big ball of mud’.
The company started suffering from more and more conflicting or ambiguous data definitions. At the same time, we didn’t know who should carry the responsibility of focussing on these issues. Analysts aren’t expected to study modeling best practices and engineers generally focus on the transactional, not the analytical side. We found a gap: nobody was focused on governing Nubank’s data.
In the next plot, we roughly model the relative differences of how four data-related chapters (Business Analysts, Data Scientists, Machine Learning Engineers, and Software Engineers) invested their energy in five selected data-related scopes before 2019. So, the further the colored polygon extends from the center, the more energy they spent on that scope.
Finally, this plot shows that no chapter took on Analytics & Reporting Data Pipelines and Data Governance & Dimensional Modeling as their primary focus. This is the gap.
Introducing a specialized data role
In an attempt to address the gap above, we decided to introduce a new specialized data role at Nubank.
We called that role Data Analyst, given that certain engineers who were already working on our data infrastructure identified themselves as data engineers. Also, we found data analyst job openings by some companies similar to what we were looking for. The first members of this new chapter joined Nubank in October 2018.
As we scaled the role and learned how to best add value in a strategic way, the identity of this new chapter evolved. Our data analysts act as multipliers, helping their squad to improve their data literacy and the design of data workflows. They promote best practices in data engineering, data modeling, and data governance.
Data analysts also spend 20% of their time working together on strategic data projects that impact the entire company. This recurrent time allocated away from the squad is uncommon at Nubank but essential for us Data Analysts, given that one of our goals is to achieve company-wide horizontal (across squads) data integration.
Below we describe a few examples of initiatives that data analysts at Nubank have been involved with so far. The first two are about squad specific projects and the third one about the aforementioned horizontal data integration, through an initiative we call ‘Core Datasets’.
Automatic data reconciliation
As stated in “Microservices at Nubank, An Overview”, one of our problems is to detect and respond to data value changes in a timely manner. Additionally, when stitching together distributed data from different microservices it can be tricky to notice that values are diverging or inconsistent.
As part of a team that is closely related to Controllership, some Data Analysts built an automatic reconciliation system to solve these issues. The team took inspiration from traditional software testing, which is categorized by unit tests and integration tests. The reconciliation system checks the data lake every day and ensures invariants for our distributed system.
In a similar vein as our automatic reconciliation project above, some Data Analysts in Nubank’s Internal Audit team built a futuristic Continuous Auditing system. Acting as an automated last line of defense, it’s full of automatic queries that run every day. When triggered, the system sends its results to another system that assists the auditors following-up on the triggered alerts.
This automation removes the need to hire a large team of auditors, which is typically inevitable at large enterprises. The Data Analysts recently succeeded in teaching the other, non-technical auditors to contribute their automatic checks platform without help.
Data governance/Core datasets
One of the main data governance initiatives the chapter kicked off early 2020 is the design of core datasets. Core datasets will offer a better experience for data users at Nubank, an alternative to the messy data lake we have today. The ‘core’ is like a stamp of approval, it reduces cognitive overhead for analysts when they’re looking for the data to use.
The stamp guarantees four things:
- It means that it’s the canonical dataset to use for that grain (what one row represents), made possible through aligning relevant stakeholders for that data across the company.
- There is a team ensuring the long-term stability of the dataset (e.g., to prevent breakage when source systems are refactored—this happens frequently and could break analysis if not responded to carefully).
- We are actively monitoring for anomalies and warn the user if anything’s wrong with the core dataset — before the user finds out.
- We’re meticulous about having consistent names and logic for column names across all core datasets. We rely on Dimensional Modeling techniques and are actively developing tooling (mostly related to conformed dimensions) to help us in this regard. We’ll go in-depth on our star schemas in a later post (we denormalize them).
The following quote from the excellent article “The Downfall of the Data Engineer”, from Maxime Beauchemin, captures the context nicely:
“The data warehouse needs to reflect the business, and the business should have clarity on how it thinks about analytics. Conflicting nomenclature and inconsistent data across different namespaces, or “data marts” are problematic. If you want to build trust in a way that supports decision-making, you need a minimum of consistency and alignment.”
We recognize the enormous challenge of aligning the company on the subject of definitions. Especially given that little of such effort was done before at Nubank. We were fortunate, though, to find buy-in from data consumers of other chapters in the company ever since we announced our plans. People are, after all, increasingly overwhelmed by the chaos in our data lake and start to recognize the value of standardization.
As a result of the hard work of the Data Analysts and stakeholders, we are now shipping the first core datasets. They are leaving ‘alpha’ status and are becoming available to stakeholders in the company — replacing legacy implementations.
To be clear, what is known today as ‘core datasets’ was only a vague ambition when we started the chapter. As of late March 2020, the ambition became reality: we’ve shipped our first core dataset and created a plan for the coming quarter. In that plan, our chapter is the conductor that seeks consensus and orchestrates ownership of core datasets to knowledgeable stakeholders.
Over time, we discovered how we want to address the data governance gap at Nubank. In addition to the core datasets, we are also focussing on data classification and personal data minimization. We believe our current name doesn’t reflect those scopes.
As the next plot shows, Nubank filled the gap with the efforts of its Data Analysts. The Data Analyst role covers the responsibility for the scopes pre- viously uncovered (which are Analytics & Reporting Data Pipelines and Data Governance & Dimensional Modeling).
Enter the Analytics Engineer role at Nubank
Over the course of 2019, the Data Analyst chapter grew from 5 to 25 members. There are now data analysts working on a number of squads throughout Nubank, collaborating with business analysts, engineers, data scientists, machine learning engineers, etc. The chapter reached a large enough scale and coverage that we can tackle company-wide data initiatives. But with the scale, the need for more role clarity when interfacing with other functions increased.
Given the name, it is perfectly normal for someone to assume that data analysts are supposed to focus on… well, analyzing data! Except that analyzing data is not a core expectation of this role. As described above, the role is much more focused on more productive analytics through data governance made possible by engineering. So we decided to search for a new name to better describe this role.
From the beginning, the Data Analyst role at Nubank had a strong connection with engineering. To use a software development concept, we essentially ‘forked’ Nubank’s engineering career development framework, adapting some of the expectations while keeping most of them the same. It felt appropriate that the role name should reflect its engineering focus.
Looking at how the industry has been describing specialized data roles lately, we ended up with two main contenders: Data Engineer and Analytics Engineer. There are recent posts describing roles very similar in spirit to the role described here using either data engineer or analytics engineer.
One interesting example of a company having defined both roles is Spotify, where data engineers seem to focus more on lower-level engineering challenges whereas analytics engineers are closer to business domains in line with the role described here.
One advantage of the data engineer term is that it is more widely used and recognized in the industry. However, the definition of what a data engineer does vary widely in different companies. The recent analytics engineer term, on the other hand, has been used much less ambiguously. We opted to optimize for clarity:
Nubank Data Analysts are now called Analytics Engineers.
The chapter unanimously agrees that the name better fits the work we do. At the same time, it’s a bit of a gamble to change our name. Especially considering hiring. We were fortunate enough to hire an awesome team of Analytics Engineers with our job posting for Data Analysts.
In other words, would we have found the same people if we had used the name Analytics Engineer in the past? As always, we’ll be closely examining our hiring process to make sure we’re still attracting the right people with our new name.
We’ll also have to work hard on internal and external communication, making sure that everyone at Nubank knows what an Analytics Engineer does. Up until now, we haven’t invested much in spreading role clarity because we wanted to figure out what our ideal scope should look like. We’re much more confident about that today and ready to start spreading the word.