At Nubank, the reliability squad is continuously seeking to improve our incident management procedure by providing tools, better processes, and much more. Our goal is to support our engineers on the journey to mitigate operational issues in a healthy environment, based on a blameless culture and being compliant with all regulatory rules regarding financial companies.
Any kind of issue affecting our systems that, in some way, impacts our customers can be considered a technical incident; it’s identified by our monitoring systems and must be fixed as soon as possible by our engineering team.
An incident can be divided into two parts: the first one being the incident handling itself and the second one the actions taken after an incident, like action plans. Let’s take a tour and see how we’re dealing with these situations that we avoid, but occasionally could happen. As important as avoiding incidents, it’s necessary to be ready for a fast and safe recovery, mitigate impacts, and provide the best experience to make our customers happy.
Identifying an Incident
Our alerting system is a subject for another post but, in short, squads can create custom alerts to their services, and each service also has a set of default alerts, such as “service down”. They’re notified on their slack channel and the on-call engineer from the squad responsible for the system is paged by OpsGenie, if an incident is identified, they have to instantly start working on it.
Opening a Crash
We follow a simple framework in which the first step is to “open a crash”. This means to notify the entire company that we are facing an incident and Nubankers are already dealing with it.
The identified incidents are reported using a bot through Slack (the main internal communication tool), this automation centralizes all the management of the incident: people use it to create, edit, and close. The main benefit of using it is to organize the situation, trigger the other stakeholders (such as the risk and compliance team) and give the proper visibility to the company. Besides that, we are also able to get data about incidents to extract key metrics, like our MTTR (one of the Accelerate metrics).
Before opening a crash, first, the engineer involved needs to understand the severity level, classifying it between 1 (critical incident) and 5 (cosmetic issue). These classifications include criteria regarding availability, amount of customers affected, product affected, regulatory matters, and others.
The main information needed to open a crash are:
- Severity: The severity of the incident, following the pattern described above.
- Brief description: A brief description of the issue.
- Affected countries: Countries where we have operations being affected.
- Point: The engineer acting as a focal point of the crash, coordinating all efforts to fix it.
- Comms: The engineer responsible for reporting crash status to the company, and giving enough information about it for whoever wants.
After submitting, a summary of the incident will be posted in Slack notifying the appropriate teams about the crash while engineers are working on fixing it.
Working on it
In this step, as you may imagine, anything can happen. People usually open a voice call and start working on debugging and fixing the issue, operations teams start preparing understandable explanations for our clients, and the focus of the engineering team is to mitigate the impact and recover the system back to its proper state.
At this point, it’s important that every one that is able to help with something gets involved (especially in high severity levels incidents), and the Nubanker in charge of comms keeps updating the incident thread with news about it – so everyone in the company can be aware of it in real-time.
After the crash is completely fixed, and nothing unusual is happening, the crash can be closed using our bot and everything is fine again!
Blameless culture and Postmortem
Postmortem is essential in incident management. Its main objective is to ensure that companies learn from crashes, register them, and ensure knowledge sharing about them.
“The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.”Google SRE Book
At Nubank we write a postmortem for all crashes of high-level severities, but we recommend it for all severities. After the crash is closed, engineers should write a document about it, following a specific template, with these topics:
- Summary: A brief summary of the crash containing severity level, point, comms, detection time, resolution time, and a description.
- Timeline of crash events: A timeline of all relevant elements involving the crash.
- Actions performed to solve the issue: A list of all actions taken to solve it.
- Customer & Business Impact: Brief description of business and customer impact of the incident.
- Root cause & Contributing factors: Description after a deep analysis and understanding of root causes and contributing factors to the crash. At this point, we encourage people to use 5 whys to help deeply understand the root cause.
- Meeting Notes: Any note about the incident that can be useful.
- Action Items: A list of action items that need to be taken to prevent the crash from happening again and to help us recover fast from future incidents.
- Regulatory: Some regulatory information that we need to inform the central bank about crashes.
- References: Any reference needed, like useful links, papers, etc.
After this document is published, it’s available for the entire company to read and learn from it, and engineers start to work on the action plan to prevent it from happening again.
We wouldn’t have a healthy environment to deal with crashes and post mortems if we didn’t live in a blameless culture: we don’t try to find a culprit, but rather try to understand what happened and what needs to be done so that it doesn’t happen again.
“Blameless culture needs to exist, and not as a rule, but as a culture of the entire company, people need to not point fingers at someone, but find the root cause, take actions to not happen again, and learn a lot from it.”
As a celebration of our blameless and postmortem culture, we have a monthly meeting with the entire company, where people involved in some crashes from the current month share lessons learned, and actions to be taken.
A common way of reacting to incidents at Nubank is to say “fascinante” (fascinating in English) while putting the hands above the head (being a Slack reaction now that we work from home), this truly symbolizes the way that we deal with incidents here, somethings it could happen, but when it happens we consider it fascinating, and we love to learn from it.
This is a picture of this meeting before the pandemic with everybody reacting with “fascinante”:
Our incident management process is constantly being updated, to always work in the best, effective and simple way. Future changes will happen (they’re always happening), but more important than the process is the culture: people acting blamelessly, helping each other, and always trying to improve and providing our customers with the best experience possible.
Blameless culture is the most important aspect of our incident management.