Engineering operational excellence, a case of continuous improvement

How to pursuit operational excellence as we navigate through seeking to solve the conundrum of how to implement processes that minimize problems and prevent incidents.

Written by: Tiago Fabre and Fredy Gadotti


Business operations can differ a lot from company to company. Nubank, on its part, requires operational processes with high standards, because any minor problem can impact the financial lives of thousands of clients.

The company has a homogenous tech stack, with thousands of services using Clojure, Kafka, Datomic, and DynamoDB, an app using Dart/Flutter, and a Data pipeline using Scala. Product engineers at Nubank have specific operational needs on a daily basis, including the development of new features; async background jobs; integration with partners; support of the customer with corner cases; debugging; monitoring infrastructure; cost-savings; and so on.

Keep reading this article and learn more about engineering operational excellence at NuSeguros!

In the one of our business unit, NuSeguros, the pursuit of operational excellence is a thrilling adventure. As we navigate through this exciting journey, we’re constantly seeking to solve the conundrum of how to implement processes that minimize problems and prevent incidents. Not only do we strive to create the best possible customer experience, but we also endeavor to cultivate a supportive environment for our skilled engineers.

On its quest to operational excellence, companies can encounter many challenges. The most perplexing of these being the main bottlenecks faced by teams in general, who often grapple with the extremes of either having overly rigorous processes or a complete lack of procedures to deal with operations. 

As we dive deeper into the realm of operational processes, we often uncover common problems and errors in the discovery phase before action plans are set into motion. The most insidious of these issues arise when operational processes are not well-defined, leading to unidentified root causes and unresolved real problems. This, in turn, can create a snowball effect that threatens to disrupt the mission for operational excellence.

The creation of an operational excellence environment at NuSeguros, for example, starts with leadership sponsorship, a key advantage that sometimes is neglected in the market. We did not have to convince them or try to make a big change from the bottom: they already knew about the importance of operations and supported it from the beginning. Here they play a big role in the creation of an innovative environment, through:

  • Blameless culture;
  • Supporting continuous improvement;
  • Prioritizing root causes;
  • Reinforcing third parties’ high standards;
  • And having a balance between short-term vs long-term initiatives.

In addition to that support, metrics and objectives are set to make the expectations clear to everyone and to help the team’s evolution through the process. This is essential to avoid the Abilene paradox一a state in which people, often afraid of conflict, don’t confront the status quo, stagnating the whole process.

Here are some examples of objectives that we have here at Nubank:

  • Learn from mistakes: we aim to deal with incidents in a methodical way, investigating their root causes, tackling them, and preventing them from happening again;
  • Have more time to add business value: this is highly connected with the previous one, because when we tackle the root causes and avoid repetitive tasks that usually only tackle symptoms, we can focus on tasks that bring value to our customers;
  • Offer the best customer experience: by solving root causes and having more time to add business value, we can focus on customer needs to provide the best experience.

There is no way we ensure we are reaching our objectives without metrics, so each objective should have at least one metric attached to it. And for the objectives above we have the following metrics: 

  • Created vs Solved root causes: this can show whether we are facing new problems or solving the known ones;
  • Low/high severity tickets: this metric can give us an idea of how much time is being spent on operational problems. If the volume is high and the severity is low, we may be facing a scenario where we need to prioritize more tech debts or initiatives that can handle multiple problems at the same time;
  • Error rate and response time: these metrics can be a proxy to identify any kind of bad experiences from the customer’s point of view.

By following these objectives, we also have processes to make sure the expectations are being met and that we can break them into planning, operating, and evolving phases.

Planning

During the design phase of new features, some topics need to be reviewed in order to have the task’s definition. First, each feature needs metrics and target objectives, known as SLIs and SLOs. This will define if a feature is malfunctioning or not. We also need to think about ways to handle problems when they happen一to guarantee that, we can have playbooks and dashboards to identify problems and set standard ways of solving them.

This can look like an onerous process, but we already have a lot of metrics being exported and dashboards by default, not to mention good tooling to set custom metrics and alerts.

Operating

After the creation and testing stages, we need to release the new features to our customers. Things we assumed during the development might be wrong: the number of accesses could be higher than expected, and infrastructure pieces out of our control can fail, such as cloud providers or 3rd parties.

The number of things that can go wrong are countless. This is why we always need engineers ready to react to these unwanted events and mitigate problems as soon as possible.

24/7 engineer on-call

Engineers at Nubank have rotations that work as usual during the business hours, to make sure that everything is working properly. After their regular working hours, they’re assigned to high severity tickets defined in the Design phase. At the Insurance group, each engineer keeps this hat during a whole week. When the shift ends, the handover is done with the next on-call engineer to take the remaining tickets from the week, as sometimes it’s not possible to solve everything during the rotation.

Ticket tracking platform

In order to achieve operational excellence, we should have the proper tools to help control what happened to the on-call week. As Peter Drucker says, “you can’t improve what you don’t measure”一that’s why all incidents that violate the thresholds defined in the SLOs will automatically open a ticket with a severity to be verified by the on-call engineer. Not only violations to SLO opens tickets. In fact, the support team can open tickets whenever the customer needs help. No matter what is the source, all the tickets will be centralized at a tracking platform that Nubank uses. This way, the team can prioritize the most important tasks.

To give visibility, a message will be sent in a monitoring Slack channel. And, depending on the incident severity, it will trigger an alarm telling the engineer to engage in the incident within the agreed engagement time.

Having this kind of tracking gives us material to focus on the initial objectives as long as they were solved in the past, for example how many customers were affected, or what was the root cause.

Playbooks

This is one of the most important things the on-call engineer can rely on! After identifying a problem, engineers should verify if the issue is known or not. When a known issue appears, the engineer can do a Standard Operational Procedure (SOP) to mitigate the problem faster and guarantee that everything will be running smoothly when the customers use the insurance.

As part of the on-call week, the engineer must create new playbooks for new errors and update the previous playbooks with new relevant information, as the environment is a living thing and changes from time to time as new features are implemented.

The main objective of the playbooks is to apply some standard procedure to mitigate the problem caused by an incident. Usually the fix is prioritized among other tasks during the regular development sprint. Having them might be helpful to new engineers during on-call rotations, or even for the same engineer that has created the playbook in the past as it reduces the cognitive load and ensures a standard approach to the problem.

Evolving

The on-call process is an endless journey and we always try to reflect on what happened during the rotation and how we can improve to smoothen it in the future and recover faster and faster. Here are some of the rituals we do:

Weekly review meeting

The turns are rotated every Wednesday at noon, so before it ends we do a meeting to show how many tickets we had, their severity, and impact among others. During these meetings we analyze all the metrics and discuss them: this might help other engineers learn how to react to new possible issues.

For every incident, we have to analyze and discover the root cause because we want to avoid problem repetition during the rotations. So, at every meeting, we check the new root causes created and which of those were solved for good.

This meeting is where we review our initial objectives and try to improve the overall process by discussing what happened during the rotation.

Postmortem

Running a postmortem is one of the most important tools that can be used by a company. People commit mistakes, but we can avoid them with consolidated processes. One of the most effective ways to improve the process is tracking everything that happens during an outage. It’s necessary to gather the events and see what we could have done better to avoid any issue from happening.

Usually, creating mechanisms is the safest way to ensure the good behavior of the system. To achieve this, we need to create alarms, guard rails, verifications controls, and what else we can do in order to avoid good intentions. There is a good thread on Twitter about how Amazon relies on mechanisms.

Running a postmortem with a five-whys technique is an amazing starting point and you can start to improve it as soon as you start to figure out what’s the real root cause of the issue.

Action items

Every meeting should generate one or more action items, and these items need to have an owner and a deadline, which is usually the next weekly meeting. These items can be bug fixes, creation of new alarms, playbook improvements, or talking to third parties, for example. The most important thing is to have an owner and a due date, otherwise you can be sure that the problem will pop up again. Solving small problems, in the beginning, doesn’t look that nice, but continuous improvement is what makes the process great.

Conclusion

As we conclude our exploration of operational excellence at NuSeguros, for example, it’s essential to understand that this journey is far from over. Operational excellence is not a one-time goal, but rather a continuous process of improvement and adaptation, always leaving room for further growth.

The key to achieving this lies in creating an environment where both experienced engineers and newcomers feel comfortable challenging existing solutions. This dynamic atmosphere not only benefits our customers, who enjoy outstanding solutions, but also empowers our engineers to focus their time and energy on impactful initiatives that drive progress.

In closing, the pursuit of operational excellence is an ongoing commitment. Together, we will continue to learn, innovate, and improve, ensuring that we consistently deliver excellence in all that we do.