A gentle and practical introduction to A/B tests

A mini guide on what are, why to use and how to perform reliable A/B testing experiments

photo of Nubank's

Written by: Isabela Piccinini
Reviewed by: Paulo Rossi, Felipe Almeida

Everyone struggles with the uncertainties involved in decision making at some point. So consider how much more complicated this can get in business environments, with the additional pressure of stakeholders and customer satisfaction.

Luckily for us, one thing our brains are designed for is dealing with evaluating the quality of evidence available to then deliberate the best course of action, thus guiding the decision making process to reduce uncertainty.

Wouldn’t it be nice for us if we had a methodology to help us through this hard and important path, making our lifes a bit easier? This is where A/B testing can be an amazing ally! It’s a way of experimenting new ideias and measuring impact of our changes in a controlled environment and learning from them before scaling solutions. It helps us leverage data-driven decisions with higher confidence of the actual results possible changes have on business strategy and final users. It may seem like magic, but it’s science! 

So, how can we improve the way we make decisions and learn from users? Keep up with us because in this article we’ll introduce you to A/B testing in a very straightforward and practical way so you can put this into action from now on.

A/B tests in a nutshell

In a company context, A/B testing is a way to perform a controlled and randomized experiment to compare two different versions (A and B) of some product or service. It uses statistical tests to establish a causal relationship between a treatment (variation) and an effect (consequence).

As with any experiment, we want to validate a hypothesis using one or more metrics of interest.

The idea behind an A/B test is challenging an existing version of your product or service (A, also known as control group) against a new version with one single specific change (B, known as treatment group). The first one serves as a baseline so you can compare the performance for your wanted change. By randomly splitting traffic and comparing chosen performance indicators of these groups against each other, whenever you find a difference that you are confident about and that confirms your initial hypothesis, then you can rollout your new product/service version. 

Let’s use a simple example as follows:

Let’s say you have an initial page in your app that currently has a button to go forward and you want to assess if changing the character “>” to the word “go” will increase traffic. You then have:

Hypothesis: Changing the button content displayed will increase user access volume by 5%.

Target metrics: Volume of clicks and number of users that click on the button.

While an experiment might sound simple,  in order to trust your results you have some important ingredients and procedures to consider and that’s why we’ll discuss them further ahead. But first, why should we care about running A/B tests?

Why are A/B tests so important?

Experimentation should be ingrained in the product development culture. It allows us to make incremental improvements based on real feedback. It also enables us to test each change with a small representative sample audience of real life users, so we can collect valuable information on top of which we can iterate on our product development cycle.

Experimentation is a booster for agile culture while reducing uncertainty and giving insights on what works and what doesn’t for our target audience. 

If the experiment goes well, we can be confident about rolling out the change for all customers. Otherwise, we fail positively by having the chance to learn from the data collected, illuminating new future ideas. This way, no harm is done for the business nor we spend a lot of human and financial resources on something that does not bring good results as expected.

Besides that, an experimentation culture fosters curiosity and aids in fighting strong opinions (aka HiPPO, highest paid person’s opinion). We humans tend to have a strong confirmation bias towards happily accepting ideas and results that adhere to our way of thinking and, at the same time, we lean towards investigating or doubting results that diverge from our assumptions. Knowing that, we can use A/B tests to our advantage, since they enable bias free impact measurement (at least when best practices for experiment design are followed and we indeed have a randomized sample as we’ll talk later on). So we choose the best idea validated by our experiment.

One accurate measurement is worth more than a thousand expert opinions.

(Admiral Grace Hopper)

A lot has been said so let’s summarize experimentation culture in four main pillars to keep in mind:

  1. Support your decision with reliable data
  2. Learn fast from the process and the results
  3. Allocate resources in a smart way
  4. Better understand what your customer needs

How to run an A/B test?

I promised this would be a practical guide, so let’s talk about some important steps to conduct a trustworthy A/B test. Of course this list is not exhaustive, but it is a good starting point.

Planning your experiment design

Before acting on it, we must carefully plan it to reduce chances of errors. In this case, errors can lead to misleading conclusions that can harm our business or hamper great opportunities. 

1. Determine a good hypothesis

The first step in ensuring the reliability of your test is to clearly define your objective (what you want to improve on your product/service). A well-defined objective helps in the elaboration of a well-structured hypothesis.

The hypothesis can be established using a IF-THEN clause: IF some action is made, THEN some consequence is expected.

In the button example given above:

Goal: Increase user engagement on the app.

Hypothesis: IF we change the button content displayed, THEN we can increase user access volume by 5%.

An A/B test aims to validate this chosen hypothesis assuming:

Null hypothesis (H0): there is no significant difference between the treatment and control variants for the metric of interest.

Alternative hypothesis (H1): there is a significant difference between the treatment and control variants for the metric of interest.

In order to run a valuable A/B test, it’s important to limit one single change (variable) at a time, otherwise you end up with combined effects. Remember that we use experiments to attribute causality, so if you have more than one change going on, how can you know for sure what is actually driving results observed? If you really need to run tests with multiple variables simultaneously, there are some design approaches for that.

2. Determine the group treatments

One group must represent the status-quo with the as-is experience (no changes whatsoever). And the other represents the to-be experience with the actual change to challenge the current version. We can even add more variations for the same feature if we want to (commonly referred to as an A/B/n test).

For our previous example, we could have:

Control: button with “>” character.

Treatment A: button with “go” word.

Treatment B: button with no character or word at all.

3. Determine success metrics

Now that you know what you want to improve, you need to choose how you will measure the impact of the improvement. For this, you can use three categories of intelligible metrics:

  • Goal metrics: set of metrics that defines what is important for the organization as a whole. These are long term north stars for the company.
  • Target metrics (or driver metrics): these are the ones your experiment was designed to optimize. It helps you decide if your hypothesis is true or not.
  • Guardrail metrics: you can think of these as safeguard (what the organization is not willing to give up). These are helpful for protection and trade-off analysis on whether to launch the new feature or better not.

For our previous example, we could have:

Goal metric: Time spent in the app (as this can be a proxy of user engagement) or the customer NPV (net present value).

Target metrics: Volume of clicks and number of users that click on this button (as this is a proxy of user interest in the new feature).

Guardrail metrics: Volume of crashes in the app and number of tickets opened in the customer service (as user satisfaction can be something very important to protect).

A good tip is choosing relative metrics like percentages, so you don’t dilute or distort results.

4. Determine the sample size

Next step is estimating the sample size needed to be able to estimate the effect you specified in your initial hypothesis.

We want our results to be relevant and to be correctly identified, so we need three important inputs to calculate the sample size:

  • Statistical power: represents the probability of the test rejecting the null hypothesis when it should be rejected. All things equal, the larger your sample size, the greater your statistical power. Common practice for A/B tests is using 80%.
  • Statistical significance (aka alpha): probability threshold below which we say that the observed change is not due to chance. So we use a threshold from which we can feel confident that the measured effect is real. Common practice for A/B tests is using 5%.
  • Effect size: the minimum effect size that should be detected with a certain probability confidence, if this effect actually exists (from our example hypothesis, 5% increase on user access volume).

With that information in hand, you can use an open source tool such as GPower to calculate the necessary sample size for your experiment.

Be aware that randomization is key to make A/B testing trustworthy. A biased sample does not represent the population of interest, therefore, the results of the experiment may be compromised and not realistic. So you may need to run some sanity checks to ensure your sample is bias free.

Analyzing your experiment results

Finally we get to run our A/B test and interpret results to check if our initial assumption was correct. To do so, we use statistical tests to accept or reject our hypothesis. This analysis includes the following steps:

  • Calculating the test statistic;
  • Calculating the corresponding p-value;
  • Compare the observed p-value against the critical value (statistical significance): if the p-value < alpha, you can reject the null hypothesis (meaning the observed effect is significant and that your assumption is correct). Otherwise, your assumption is not correct (or you don’t have enough power in your experiment to measure an effect so small);
  • Calculate the margin of error;
  • Calculate the confidence interval (be aware that non-overlapping intervals imply statistical significance, but the reverse is not necessarily true).

The usual spotlight at this stage is on target metrics, since those are the ones we primarily want to optimize. But it is as relevant to assess whether there are guardrail metrics negatively impacted and the acceptable limit. Not only can this be a block for continuing your experiment and eventually rollout this new experience, but also it is very important that you know about this impact and that you already have an action plan in hand (which could be, for example, the interruption of the test).

Some questions you can ask yourself before making your decision on rolling out the new feature or not are:

  • Has the planned power been reached (minimum number of users as designed for the experiment)?
  • Are results statistically significant for your target metrics?
  • Have any guardrail metrics been negatively impacted?
  • Are there tangible losses (e.g. cost) and/or intangible losses (e.g. brand positioning) if this rollout does not occur?

Always remember that the experiment will provide evidence for decision making. So the decision can be as good as the data we are using. 

“In God we trust. All others must bring HIGH QUALITY data.”

(adapted from Edwards Deming)

I hope this brief guide can help you through your experimentation journey. Happy A/B testing! 🙂

To learn more about A/B tests, check out the recording of Giovana’s and Isabela’s talk at the Building Nu Meetup:

A/B test Meetup’s video recording

Enter your name