How we do On-Call Rotations at Nubank

System stability is a core part of our business, but sometimes things can go awry. Here is the engineering team strategy to be ready to respond 24/7.

So what is on-call? On-call is when an engineer is available to respond immediately to service malfunction, at any time, any day of the year. It usually entails some sort of automatic alerting system, paired with a way of notifying the engineer.

For the scope of this article, we define an Alert as an automatically configured process that fires off when a given threshold, like error rate or available memory, is reached.

A Notification or Page is usually triggered by the alert and reaches the engineer through a mobile app notification or phone call.

The on-call engineer is expected to be able to respond; that is, they have the necessary tools at hand, like internet access and a laptop, and are qualified.

There are other models for this. Notably, in companies that have a team of people, usually called SREs, who are on call for all systems.

How to define quality for on-call?

Two primary metrics can track the quality of an on-call rotation: Mean Time To Acknowledge (MTTA) and Mean Time To Resolve (MTTR).

The first one tracks how quickly an engineer acknowledges a page, and it reveals how healthy a given rotation is. The other tracks how quickly an acknowledged page is resolved; it shows how good the tooling and documentation are. 

With that in mind, it would be natural to assume that the most suited group of people to be on call for a set of services is the same group that builds and maintains these services.

However, more often than not, teams are formed by two to eight people, meaning that they would be on call many days a month, which leads to the next point.

Being on call is not an enjoyable activity

Picture the unlikely case that your company produces software that never malfunctions. Even so, it’s not fun to be trapped at home, having to bring your work phone to the bathroom, not being able to have some wine, and knowing that, at any point, you might be woken up by the dreadful, harsh sound of the pager.

I have been woken up in the middle of the night a few times, full of adrenaline, already reaching for the laptop. Going back to sleep afterward is challenging.

Given that, it seems to be a desirable goal to have engineers on call as few times per month as possible.

But how to pair that up with quality?

Making trade-offs

To reduce the amount of on-call time, you need to have more people in the rotation. Apart from making teams larger than intended, the only possible way is to have multiple teams on call for all the systems in the pool, meaning that the people who are on call are not necessarily in charge of maintaining all systems.

This situation may cause anxiety: how can I be on call for a system I don’t know?

Here are the prerequisites to make it possible:

Monitoring & Tools

Without proper monitoring, it is impossible to achieve this goal. You must have comprehensible dashboards and troubleshooting tools.

Documentation

No alert should be created without a very thorough runbook.

Runbooks must be created and tested with engineers outside of the team. When someone writes a runbook, they will inadvertently make assumptions about the knowledge of the engineer who will read it. “Connect to the production server” might mean nothing to someone else.

Remember that the person reading the runbook is under high stress. They are worried and agitated. The last thing they need is to find out that the links in the runbook lead to nowhere.

A runbook starts with a link to the relevant dashboard showing the data that triggered the alert.

Then, it lists instructions, with IPs, bash commands, etc., to troubleshoot and restore service.

Absolutely no flaky alerts

When a team is on call for their own systems, they know that some alerts are a little flaky. “This one always fires on Friday evenings and it auto-heals in three minutes,” or “This one fires off every time there is a deployment.” They happily acknowledge the page and go back to whatever they were doing.

This case absolutely cannot happen when you have shared on-call responsibilities. The other engineer won’t know that this is the case. They will wake up, open their laptops, just to find the alert already resolved — not a good way to make friends.

Configuring the rotation

Days of the week

There are many possible configurations, with weekly or daily rotations being very popular. After many iterations and retrospectives, we concluded, unsurprisingly, that people care way more about their weekends than weekdays.

According to this, our recommended setup is five shifts per week, which result in the following amount of hours on-call outside of the standard 8 hours in the office:

  • Monday – Tuesday: 32h
  • Wednesday – Thursday: 32h
  • Friday: 16h
  • Saturday: 24h
  • Sunday: 24h

Shifts usually start at the time most engineers are in the office, let’s say 10 am, and finish at 10 am the next day (or 2, depending on the shift).

That means five different people are necessary to fill up all shifts for a week’s worth of on-call.

Two-layers: primary and best-effort

Most tools provide a way to have a multi-layered on-call rotation.

Our recommended configuration is as follows:

  1. Primary rotation with everybody in the pool.
  2. Mandatory.
  3. Only one person at a time.
  4. Secondary rotation with all the engineers from the team that owns the alert.
  5. Best-effort: people in this rotation are not expected to be available and ready to respond.
  6. If a page reaches this rotation, everyone in it gets paged at the same time.

With this configuration, let’s say that the primary engineer gets paged, follows the runbook, but is unable to restore service. When they escalate the page, it will notify all members of the secondary rotation.

The fact that the secondary rotation is the best-effort can make people nervous. What if no one can respond?

We certainly shared this concern; however, after actual experimentation and dozens and dozens of pages, not once did we have issues with the secondary layer not responding.

Implementing it

Assuming that your company already has on-call rotations in place, usually one per team, my recommendation is to start small.

First, pick two teams to merge and talk it through. If the teams have some intersection in context, it might be a little easier.

Then, get your hands on some alerting statistics: how many alerts per month, how many outside working hours. If you can, find out how many auto-resolved.

With these data in hand, have the two teams clean up the alerts: remove some, fine-tune others. Definitely write runbooks for all of them.

Also, this might be the right moment to read or reread Rob Ewaschuk’s excellent Philosophy on Alerting at Google. We have found that most teams have, at most, four to five critical alerts that should wake people in the middle of the night, many times it is less than that.

In general, critical alerts should point to symptoms that affect users, not merely elevated error rates or backlogging in a given service.

With the alerts cleaned up, it is time to configure the rotation.

For the first week or two, to get people a little more confident, you can still keep the per-team mandatory rotation but have the alerts route to the merged rotation first. 

As the teams get more comfortable, and hopefully happier with the new setup, you can slowly add other teams to the rotation, until you reach either every engineer in the company or a good-enough amount. If you have twenty people in the rotation, a person would be on call only one shift per month.

It is strongly recommended that you book a monthly retrospective with all the engineers in the merged rotation, in which you also share alert statistics.

It is also important to have a channel for all the engineers in the rotation to discuss current alerts, poke people for missing runbooks, rage about the flaky alert that woke them up the night before, and negotiate shift swaps.

Onboarding new engineers on the rotation

This is the recommended way to reduce the anxiety of a new person joining the rotation:

  1. Remind them that the objective of the person on call is to restore service. It is not to fix underlying bugs, understand the whole architecture, etc. Simply follow the runbook, and if it doesn’t restore service, escalate the page.
  2. Explain to them that they are not supposed to be online, checking messages and emails. They are expected to respond to an automatically triggered alert.
  3. Grab an old alert that happened a few days before, have them follow the instructions in the runbook, show the dashboards at the time of the event.
  4. And finally, add them to the channel and the rotation.

Common problems

1. Alerts during working hours

Despite being possible to silent alerts in Prometheus’ Alert Manager, it is more common that people forget to do that preemptively.

Solving this is surprisingly tricky. Even though the tools have configurations for the time of the day, they don’t have local holiday calendar support.

Unless you are willing to have to remember to update the configuration manually for those, you might end up with alerts only routing the best-effort layer.

2. Flaky alerts

These are the bane of the shared on-call. People get unsurprisingly and rightfully upset when they happen. It is paramount that the leader of each team takes time to fine-tune flaky alerts as soon as possible, immediately or the very next day.

3. Too many alerts for a given team

If one team causes alerts at a disproportionate amount when compared to the others, people will get bitter. It is possible to set “maximum quotas” per team, per month, and if they are exceeded, the team reverts to local on-call duties.

We didn’t find we need to implement such a policy but have contemplated this possibility in the past. Managing shared on-call morale is essential.

4. Adding and removing people from the rotation

OpsGenie doesn’t do an amazing job at keeping the existing schedule when adding or removing people from the rotation. It seems to be a difficult problem to solve, given the infinite nature of on-call schedules.

Remember to announce these changes in the rotation channel, so that people can check if anything has changed.

5. A particular team has too much specific knowledge

Sometimes, a team is too far away in technology from the others, making writing runbooks nearly impossible.

When trying to implement this, some teams will claim that this is the case. Analyze every situation with care and work with the team to understand whether it is the moment to make foundational changes or not.

If not, the team should carry on with its rotation.

Tooling

Here at Nubank, we use Prometheus’ Alert Manager for alerting, and OpsGenie for notifying.

All our systems are monitored through Prometheus already, so using their Alert Manager was the obvious choice.

Conclusion

Over time, as rotations evolve, we have observed some beneficial side-effects apart from personal well-being and on-call quality:

  • Better alerts
  • Better dashboards
  • Better runbooks
  • More homogeneous technology across the company

And, most importantly, better systems that alert less and self-heal in more cases.

Enter your name

Receive the newsletter