Why We Killed Our End-to-End Test Suite

In the past at Nubank we used an End-to-End test suite to find issues across the boundaries of our microservices architecture in a staging environment. That implies that the interactions between services very often are backed by actual databases, messaging systems, HTTP requests, etc.

In general, End-to-End are black-box tests in the sense that we stimulate one of the inputs of the system (e.g. by making an HTTP request to an endpoint or producing an asynchronous message to a topic).

The system then produces several interactions between its parts; for example: possibly many other HTTP requests and/or asynchronous messages are produced. And we check the validity of these interactions by checking specific outputs; for example by calling another HTTP endpoint to check if the desired effect has been produced.

End-to-end in a fintech

As a fintech, quality is of the utmost importance for us. We need our customers to trust us with their money. Our End-to-End test suite complemented our testing strategy to assure our systems were of very high quality and integrity.

If this is such a well known best practice among tech companies and other banks, why would we consider changing strategies?

It turns out that this practice has a lot of downsides. In my early days working here I conducted an assessment of engineering pain points. Speaking to many different people and teams, the struggle with this type of test became a common theme.

Check our job opportunies

A diagnosis of our end-to-end test suite

Most companies still believe that End-to-End Integration Tests are the best way to catch bugs. But they also experience a progressively slow down of value delivery due to these pain points that we uncovered during the assessment:

Waiting. Engineers had to wait more and more to get feedback from this long-running suite;
Lack of confidence. Flaky tests meant that we had to re-run the suite frequently to see if something was really wrong or just a false negative;
Expensive to maintain. Manual changes in our staging environment corrupted test data fixtures and maintaining the environment “clean” was a challenge;
Failures don’t point to obvious issues. Test failures were very hard to debug, specially due to our reliance on asynchronous communication that make it hard to connect the cause of failure (a message not published to a queue) with its effect (changes not made in another system);
Slower value delivery. Queueing of commits in the End-to-End suite resulted in less frequent deployments;
Not efficient. Few bugs caught in this stage. One experiment suggested that, for every 1000 runs, we had 42 failures, only 1 bug;
Not effective. Bugs were still being found in production.

Presentation to engineering leadership

When showing the results of my assessment to the CTO and engineering leadership, I presented the situation:

“If these trends continue, we’ll take more and more time in our continuous integration and deployment pipelines, and eventually find ourselves stuck in a corner. Folks will commit bigger and bigger batches of changes that will take longer to be deployed, while still being open to the risk of bugs in production”.

A new hope: Contract Testing

This diagnosis was not something novel or specific to Nubank. In my previous experiences I had seen the same situation in many Fortune 500 companies that struggled with their End-to-End test automation. But at Nubank we were determined to change this. One of our Sr Staff Engineers, Rafael Ferreira, ran some numbers and applied queueing theory. He determined that by 2021 Nubank’s End-to-End test suite would take… an infinite time to run!

Why We Killed Our End-to-End Test Suite at Nubank. The images show the time needed to run by applying queuing theory.

So we decided to explore Consumer Driven Contract (CDC) testing as an option.

Contract tests allow us to describe the interactions between our services through expectations to be met by their inputs and outputs. For instance, a billing service needs the customer’s first and last name to generate a bill. So, we may describe the interaction between billing and customer services according to a contract: the customer service exposes the endpoint GET /customers/{id} and expects to receive a valid UUID.

The billing service in turn, as a consumer of this endpoint, expects to receive two attributes: first_name and last_name (both non-empty strings). Any unexpected changes on these assumptions (e.g. the customer service changes the last_name attribute to be optional meaning that it may be empty) consist in a contract breakage that may affect the behavior of business flows in runtime and must be caught by contract testing tools.

So in summary:

End-to-End tests are helpful at the beginning of a project when things are still simple and confined to a single team. However, as both project and team grow, they start to become a bottleneck because they demand coordination between teams and infrastructure work to keep them running reliably.
Contract Tests, on the other hand, require less coordination and infrastructure. It means that teams can evolve independently of each other. This is especially useful in a microservices architecture because the number of integration points tends to grow exponentially.

Why we chose to build our own framework

How these contracts are declared and validated depends on the contract testing framework in question. However, a common characteristic is that inputs and outputs are collected and validated without executing black-box tests against actual instances running on production-like environments.

Implementations of such concept may vary:

one can verify declarative schemas through a sort of generative test; or
test the calls made by a client application (the billing service in our example) against a mock server that responds with the same data returned by the actual server (the customer service).

The second approach is the one implemented by Pact. We started it, since it’s a quite consolidated framework for writing Contract Tests. However, two aspects caught our attention in our first experiments:

The support for messaging tests was immature in the JVM implementation: most of the critical interactions between our microservices occurs through Kafka messages (we favor mutations in asynchronous flows while HTTP calls are mostly reserved for read-only operations). Therefore, a solid support for asynchronous interactions was a crucial factor for us;
The JVM version lacked satisfactory support for Clojure, the programming language we use for the absolute majority of our microservices.

Our decision

Rafael and our team, including Lead Engineers Rui Hayashi and Alan Ghelardi, considering the above aspects, decided we should develop our own implementation of a Contract Tests tool.

In fact, we picked many aspects from the Pact framework as well as from the consumer-driven contracts pattern as a source of inspiration. However, throughout this journey, we realized that even the traditional model of Contract Tests had some downsides that could make their adoption difficult for our circumstances.

In general, Contract Tests are strongly dependent on the correct (and often complex) initial state in the microservices being tested to exercise relevant interactions among them. For very simple interactions (like those that frequently appear in examples of CDC tests) this might not be a problem, but in the context of a financial company with complex business rules spread across a wide variety of services, certainly, engineers would struggle very often to put their microservices into valid states to write useful tests.

Building Sachem

And so we decided to create Sachem, our very own contract testing framework, to deprecate End-to-End testing in staging as a practice. One interesting thing about this project is that we realized we already had the contracts in our microservices. Clojure has this library Schema that allows you to richly describe data structures.

It was already common practice to write schemas for every HTTP endpoint and Kafka topic. What we did was to build a tool that collects those schemas and checks if they are compatible.

Therefore, nothing really changed in terms of communication between teams, but it shows the value of having those contracts in place.

Factors considered

We have learned in a few experimental sessions with our engineers that, to be successful, our solution should be less intrusive and organically adhere to our codebase by leveraging existing aspects of our architecture, preferably, without forcing engineers to write complex tests. To proceed, we have considered the following factors:

Homogeneity. Our microservices are pretty homogeneous – they are mostly written in Clojure, generated by the same template, and follow very similar standards;
Schemas. As mentioned, all ports of our services (HTTP clients and servers as well as Kafka consumers and producers, following the Ports and Adapters pattern) declare schemas for their inputs and outputs;
Validation. Those schemas are plain Clojure data structures used to parse and validate inputs and outputs in runtime;
Violations. In our analysis, we figured out that the most frequent category of bugs caught by End-to-End tests was schema violations.

With those aspects in mind, we gave up on the idea of providing a replacement for testing distributed business logic among services through contract tests and concentrated on validating uniquely their schemas with Sachem.

One strategy to rule them all?

Getting rid of E2E tests really helped to eliminate the coordination problems. Before Sachem we experienced issues frequently, since each team had a turn in the E2E pipeline, and they would barter to skip ahead of the queue to get their change in production faster, causing friction with other teams.

The engineering cost of keeping the tests working, waiting to get code into production, and so on, were the main drivers of this change. It used to take at least two hours, in the best scenario, to get some code into production. It could take more than a day. Just think of the cost of context switching between tasks while you wait for something else to go to production.

At first, the goal was to completely remove the End-to-End tests, but in the long-term, we found that we missed some of its benefits. Mainly, Contract Tests can catch structural incompatibilities, but they are not good in testing behavior.

The guarantee

But to keep our high quality standard we also needed to guarantee critical behaviour of our applications.

So we found another way to complement our suites, by building what we call acceptance tests (not to be confused with Gherkin style acceptance tests).

The main difference to the old E2E is that they encompass only a subset of services and don’t require spinning a production-like environment (the services run in memory on a single JVM and HTTP/Kafka communication is replaced by in-process communication). They are used in specific flows that we find too critical to only rely on Contract Tests.

We also have an experimentation platform whereby people can run experiments against a subset of our customer base. It’s a common practice to use techniques like feature flags, percentage rollouts, a/b testing, and so on. This has been working well both as a pure technical testing mechanism, but most importantly to gather business insights on new features.

We have tried to implement distributed tracing instrumentation but turned it off because of a mix of not having much uptake from our internal users, and for the difficulty of maintaining the infrastructure. We intend to revisit that in the future.

The results

The results were quite remarkable, especially for two important metrics:

Cycle time: the time from merging something into the main branch until deploying in production went from unpredictable (hours or even days) to about 20 mins.

Why We Killed Our End-to-End Test Suite at Nubank. From umpredictable to 20 minutes by using Contract Testing.

The number of deploys per week: we were doing at most a hundred deploys a week with E2E. After Contract Testing, that started to grow exponentially. Today, we are in the order of a thousand deploys per week. You may say that was just the growth of the company, but since E2E requires a queue, we would always be bound to a maximum number of deploys per week, which also means the cycle time would get worse and worse.

Why We Killed Our End-to-End Test Suite at Nubank. Image here shows the number of deploys per week after using Contract Testing.

It is important to mention that these are two of the four “Accelerate” metrics. The Accelerate book shows that companies that excel in those metrics are among the high performers in the Industry.

I would like to thank César Vortmann for the inspiration for this post and questions that drove it. We have previously mentioned on CDC in this podcast and on this video on how we do end-to-end tests for our Microservices architecture. It became clear people wanted to hear more and we hope this post will help folks out there considering these different testing strategies. Also thanks to Rui Hayashi and Alan Ghelardi for being co-writers on getting answers for this article and Rafael Ferreira, Paulo Victor and Ezequiel Siddig for your review.

Check our job opportunies

Why We Killed Our End-to-End Test Suite

End-to-end in a fintech