The value of canonicity

What happens when we run an engineering organization by constraining the number of tools in our toolbox “for the greater good”?

A dirty road boarded by blooming flowers.

When people ask about Nubank’s technology stack, the answer is quite short. We use the same few technologies for most of our backend systems: Clojure for production services, Kafka for asynchronous communication, Datomic as our database for high value business data, Scala for our analytical environment, and Flutter for our mobile app.

After seven years of building products, now with more than 600 engineers, one could ask how we settled on these specific technologies. However, the question we want to answer here is:

What happens when we run an engineering organization by constraining the number of tools in our toolbox “for the greater good”?

More specifically, if we prefer having less variance on our choice and usage of technologies, favoring canonical approaches, does that result in a more efficient engineering organization?

What is technology variance? Is it all bad?

“Having less variance” may be more precisely worded as “avoiding non-essential variance”. That is, we want to incentivize change when a new situation compels us to use different technologies that provide better value for a given task (i.e., essential variance). In contrast, we don’t want to pick a new technology just for the sake of using a different one (i.e., non-essential variance). Or, to put in another way, we prefer having canonical ways of doing things.

When prompted with this “fewer technologies” theme, some engineers could react with: “would you agree to continue using an old language, like COBOL, for the rest of your life?”. There’s a fair point here – in many situations, limiting options can feel, well, limiting, especially when the benefits are longer-term and diluted across the organization.

But preferring less variance isn’t the same as avoiding evolution or being unwilling to consider alternatives. It means that in the trade-off spectrum from “COBOL for the rest of life” to “shining new tech every week”, we lean towards the side of solving similar problems consistently across the whole org (perhaps more towards the COBOL side in this analogy).

“Choosing the COBOL side” is a curious phrase for 2020, but it’s crucial to unpack what this means in practice. We want to use an excellent tool for every job, so if what we have now fits the purpose and alternatives have no clear benefit, we prefer not adding new technologies.

If that’s not the case, we’re not afraid of improving our toolbox and start using a new one. For example, when building the first version of our ETL jobs, we intentionally decided to use Spark with Scala instead of Clojure. Or when starting to develop services involving Machine Learning, we chose Python, again instead of Clojure.

We saw them as significantly better tools for those situations in the context of the time. Those were strategic and thoughtful decisions, and that’s what we aspire for every time we think of deviating in our usage of technologies.

Inner variance

Another relevant point is that variance is not only present when we pick a brand new technology (e.g., new database or framework). It’s also about how we use the ones we already have. If you pick a programming language, for example, you can see code style, features, frameworks, and libraries as sources of variance. A lot of them are so flexible that you can make two pieces of code written with the same technology look alien to one another. Most of these differences occur organically, especially with a growing number of engineers, as it’s natural for opinions to diverge on how to use a technology. That is why avoiding non-essential variance is necessary not only in a higher level sense (e.g., programming languages, databases) but also at an inner level (e.g., code style, language features) when striving for canonicity at an organizational level.

At Nubank, looking closer to our usage of technologies, we see a high level of consistency in how we use them.

For example, if we select two random production Clojure codebases, code structure (files and folders) for both would be quite similar. They’d likely use the same libraries and frameworks. And, finally, there’s the Clojure language itself. It is flexible, and programmers can use it in different ways (hello macros!), but it’s also simple and encourages canonical approaches to common problems. Looking at these two random services, you would probably think that the same team wrote both pieces of code.

Standardizing usage is not easy to achieve or maintain – it requires intentionality, pervasive code reviewing, senior oversight, and automation (e.g., templates for new services), to name a few.

Although intuitively all this consistency and homogeneity in choice and usage of technologies may sound compelling, the real benefits may not yet be apparent. Why bother with this?

Dependency mitigation

Nubank divides itself into groups of small cross-functional teams, and we aim for them to be self-sufficient when building new things with as few dependencies as possible on others. A team should be able to code, test, and deploy something to production without waiting (i.e., depending) on another team to do some work (e.g., implement features, run pipelines, create new infrastructure). We’ve managed to eliminate many dependencies by relentlessly automating things, but with an ever-growing number of systems, teams inevitably start to specialize and own small portions of the codebase. In that context, engineers will, sooner or later, face tasks that require working outside of their team’s domain and codebase. One typical occurrence of this is when one group needs to access data in a service owned by another team (that is not already exposed by a pre-existing API). A possible result would be for the engineers to depend (i.e., wait) on the other team to create a new endpoint for them to access the data.

We’ve managed to heavily mitigate this kind of dependency by following a simple principle: All software at Nubank should be open for collaboration, which means that any engineer should be capable of and allowed to propose a change to any service. Of course, it’s still a good practice to reach out in advance and align with the service owners. Contributions will be reviewed, approved, and merged by them and they are accountable for its reliable operation. In practice, this works better than waiting for owners to change their priorities to accomodate a demand from a separate team. Thus, any engineer has the ability to create the new endpoint themselves.

That’s great in theory, but how easy can it be to code in a foreign codebase that you discovered 5 minutes prior? It could be written in a different programming language to which you’re unaccustomed. Or it could use a new NoSQL database that came out last year that you only read about on Hacker News. Or it could use the same technologies you’re used to but in a significantly different way (e.g., OO-ish code instead of functional). Any of these possibilities can create technological barriers to collaboration.

That’s when having fewer items in our toolbox helps a lot. At Nubank, there’s a very high chance that your team and the foreign one both use Clojure for backend services. And that both use Kafka. And that both use Datomic. And that both have similar code styles.

Therefore, you can focus on understanding the domain, the business problem to be solved, the status quo of the codebase, and how it needs to be evolved.

This ease of changing foreign services can cause issues (often, similar to open source, when contributions aren’t heading in the desired direction of evolution of the codebase), but code reviewing and over-communication are effective guard-rails to avoid problems. After all, we’re not entirely eliminating the dependency between teams, we’re merely streamlining the practical resolution of a potential blocker. We’re keeping our dependencies explicit between services rather than between cards on different backlogs. We shifted that dependency towards lighter processes: alignment and code reviewing.

At Nubank, this collaborative nature of code has been essential for teams to stay as much as possible on their flow. It also helped create a company-wide code ownership mindset that has been easy to evolve with our org structure (somewhat mitigating Conway’s Law). All that was hugely enabled by the fact that we didn’t have technological barriers between teams.

Moving between teams is less painful

In the hyper-growth that Nubank has experienced over the last few years, one thing has been evident: priorities will shift. One consequence is that we often need to create new teams or change the current ones, which inherently involves moving engineers.

What happens, then, when an engineer moves to a different team? Aside from getting used to new people and dynamics, the main challenge is that they need to learn a new technical and business domain: What is the product? Who is the customer? What services do we have? What technologies do we use?

We can’t say that we were able to make this transition irrelevant or a nonissue. But, if an engineer from one group can quickly go to another team’s service, understand the code and propose changes, it probably means that they could much faster understand the new team’s technical context.

Although we still want to avoid thrashing people between groups and contexts too fast, over the years, engineers have been able to quickly move without worrying too much about learning new technologies. They can focus on understanding the new team’s specific business domain (which is no small feat). The ease of moving engineers gives us more flexibility to allocate engineers to the highest priorities or to better places for them to grow in their careers, without overly disrupting productivity.

High leverage technical improvements

At scale everything eventually breaks, and we want to fix each thing, preferably, only once. Also, at scale, even small improvements to engineering productivity can have massive impacts.

Let’s say you have a company with five engineers, who are all using Java-based services. If you decide to improve their lives by, let’s say, putting a linter on the build pipeline, you can be sure that you improved the productivity for all of your engineers. Imagine that alongside your initial team of five engineers, you have another one with five engineers who use Clojure instead of Java. Investing time to put that Java linter will not benefit the Clojure engineers.

Every “virtual split” you have (e.g., programming language, frameworks, databases) means that you’re diminishing your impact when improving tooling. The usage of what you’re changing limits the impact radius of your improvement. If you magnify the previous example to 200, or 1000 engineers, the differences become more evident. Peter Seibel has put it concisely:

“Once your engineering org gets to be a certain size the benefits you can obtain by investing in making all your engineers slightly more productive start to swamp the slight gains that one team might get from doing things their own, slightly different way.”

When we look at horizontal teams (e.g., infrastructure, engineering productivity, security), the situation is even more apparent. Every “virtual split” means you’re putting more work into their backlogs because improvements and fixes to one technology do not necessarily translate trivially to another.

A concrete example of this is service security. We have implemented it for Clojure services and every new security-related improvement or fix can be easily rolled out for all of our services. If we had, for example, some Node.js services, we would need to first re-implement all the security logic in this new platform, then every time we wanted an improvement, the InfoSec team would need to implement in both Clojure (JVM-based) code and Node.js code.

Another way to see this is through our common libraries. Pieces of code that are commonly used in multiple services are normally extracted into libraries (or possibly platform services depending on the case). We have standard libraries for talking to Datomic, DynamoDB, producing and consuming from Kafka, making HTTP requests to other services, process positional files, generate PDFs, and much more. For every different runtime we officially support, we increase the effort necessary to maintain and evolve common patterns.

Paving a road while also going off-road

Normality is a paved road: It’s comfortable to walk, but no flowers grow on it

Vincent van Gogh

The benefits may seem straightforward, but how and when should we deviate from the canonical approaches and introduce variance? We see our choice of picking fewer technologies and actively leveraging them as creating a “paved road”, which means having a smooth and as effortless as possible ride while coding at Nubank, with polished and efficient tools for the most common jobs. But having a nice paved road doesn’t mean we stay on it all the time. As Van Gogh has said: “Normality is a paved road: It’s comfortable to walk, but no flowers grow on it”.

So, while our main road should be our path of least resistance, accelerating our inner loop for common problems, we sometimes need to go “off-road” and look out for flowers. That should happen when a tool we already use: (1) is not ideal, and there are better options; or (2) doesn’t work at all for the job. Only going off-road is not enough, however. If we keep branching out into new paths, we risk never having the time, energy, and leverage to pave our main road, making it continuously better.

Continuing with the flower metaphor, Peter Seibel memorably articulated the same concept in his 2015 article title: “Let a 1,000 flowers bloom. Then rip 999 of them out by the roots”. That is, teams should be encouraged to experiment with ideas in an autonomous way (i.e., flowers blooming off the paved road) while knowing that most of them will not “succeed”, meaning that they will be deprecated and won’t be actively maintained (i.e., get ripped).  But the times when they do succeed, we invest behind them and make them part of our paved road.

Our mobile technologies are a good example of this. At the beginning of Nubank in 2014, we used Java (Android) and Objective-C (iOS). When better tools appeared, we started using them, and largely migrated to Kotlin (Android) and Swift (iOS). At some point, our bank account team experimented with React Native, a new cross-platform technology at the time. While most of our credit card related app functionality remained native, the bank account screens were entirely React Native.

The team learned a lot along the way and, eventually, decided to experiment with Flutter to see if it would provide a better developer experience and better toolchain solidity. A few features were then coded in Flutter. A few flowers bloomed, and, given the unhealthy fragmentation that was happening, we had to pick which flower to invest behind. You can read more about it in the article, but, spoiler alert, Flutter is now our “mobile paved road”, although quite still in construction.

The blooming and, eventually, ripping of flowers are not necessarily one-time decisions or points in time, though. They are usually a process occurring over months or even years. Here are a few examples from Nubank:

  • We’ve been using a homegrown integration testing framework since the beginning of Nubank, which was later recreated in open-source as Selvage. For the past couple of years, we’ve also been experimenting with a new framework called State-flow. Recently, we chose to standardize around the latter and deprecate other variants. The business value of doing a full, forceful migration is less clear, however, and we expect this transition to continue organically for the time being.
  • A case of something “just not working for the job” was when we tried Clojurescript with React Native. Although we use Clojurescript extensively on the Web, unfortunately, at that time, it had too many caveats to use within our app. We decided ClojureScript wasn’t viable for mobile and ended up going with Typescript instead.
  • We used to depend on Riemann for all of our monitoring but eventually moved on to the more feature-full Prometheus ecosystem in a full migration that took a few months.
  • With regard to backend for frontend services, we’ve mostly adopted graph APIs (like GraphQL and Pathom) over REST.  A clear decision to standardize on one, the other, or live with a mix is in the works.
  • For web interfaces, we’ve used re-frame extensively, and, in the past years, we’ve been experimenting with Pathom and Fulcro. At the same time, our public web page is in Typescript and React. Flutter Web is also a thing. And we still have a legacy public front-end in Angular 2. In summary, in the web front-end world, there are plenty of flowers blooming (and beauty is in the eye of the beholder for now).

In summary, every time a team thinks we can get a good improvement from a new technology or approach, we experiment with it. After trying it out, sometimes it’s clear that we want to invest and fully complete a migration, promoting a “dirt road” to a paved road. Other times, it’s not so clear, and we need more time and energy to determine which flowers should continue to bloom and which ones we need to let go. The important thing is that teams are aware of these lifecycle dynamics and generally agree on the value of leverage achieved through consistently aligning on canonical approaches companywide, or, in other words, the value of canonicity.

Final thoughts

Investing time in our paved road, focusing on fewer technologies, has created huge benefits for how Nubank engineers work. These benefits keep showing up as we grow our organization beyond hundreds of engineers. But those rewards only came with intentionality.

In a group of engineers, there’s a natural and desired tendency to bring their backgrounds to their choices, making it likely that if one is 100% autonomous to decide which technologies to use and how to use them, teams eventually diverge on how to do things. Because of that, we need to be intentional in avoiding non-essential variance. Otherwise, we can’t reap the benefits as an organization. And, being deliberate about it means sometimes making hard choices and having hard conversations. Ripping flowers by their roots is no fun.

Having diversity in how we see the world, how we think, and what we try, combined with the psychological safety to challenge one another, is essential for the continued health of our business.

Given how we’re dedicated to using fewer technologies, one risk we’re taking is forming a “monoculture” around the technologies in our paved road (e.g., Clojure, Datomic, Kafka). In such a culture, people would prefer praising a technology choice and agreeing with the group instead of diversity of thought and critical thinking. We don’t want that. Van Gogh would probably agree that no flowers grow in a dull culture like that as well. Having diversity in how we see the world, how we think, and what we try, combined with the psychological safety to challenge one another, is essential for the continued health of our business.

On the opposite side of the “mono/multi-culture” spectrum, we see companies with lots of technology fragmentation (e.g., multiple programming languages with numerous ways of using them). We believe this can easily lead to cultural fragmentation and reduced leverage. For example, people from different programming language communities can have different opinions on how to structure code and services. By itself, not a big deal, but over time these small differences in views and tools, combined with no migration occurring between camps, can build up to become virtually two whole different companies (i.e., cultures) inside a single one. Building culture and evolving it is hard – tribalism makes it harder.

Finally, the challenges keep changing as we grow the company. The alignment and intentionality were easier (but not easy) when we had 50 engineers, but it’s different and trickier for 600. We have our difficult and controversial questions: how much to enforce things, how much to meddle with our blooming flowers, when to branch the “dirt road” back into the paved road. Nevertheless, having a mentality that avoids non-essential variance in our usage of technologies has been a crucial advantage for Nubank’s engineering, allowing us to be more collaborative, flexible, and efficient. If we keep this mindset throughout the years, it won’t matter if we’re opinionated or different from other companies, as long as we continue to have a nice paved road for our engineers to work efficiently and flowers blooming all over our dirt roads.

Enter your name

  • Why We Killed Our End-to-end Test Suite - Building Nubank
    September 24, 2021 - 3:48 pm
    […] Read also: “The value of canonictiy”. […]