A/B Testing

  1. True or False
  2. Interpreting the Results
  3. Multiple Alternatives
  4. A/B Testing and Code Testing
  5. Let the Experiment Decide

A/B testing (or “split testing”) are experiments you can run to compare the performance of different alternatives. A classical example is using an A/B test to compare two versions of a landing page, to find out which alternative leads to more registrations.

You can use A/B tests to gauge interest in a new feature, response to a feature change, improve the site’s design and copy, and so forth. In spite of the name, you can use A/B tests to check out more than two alternatives.

“If you are not embarrassed by the first version of your product, you’ve launched too late” — Reid Hoffman, founder of LinkedIn

True or False

Let’s start with a simple experiment. We have this idea that a bigger sign-up link will increase the number of people who sign up for our service. Let’s see how well our hypothesis holds.

We already have a metric we’re monitoring, and our experiment will measure against it:

ab_test "Big signup link" do
  description "Testing to see if a bigger sign-up link increases number of signups."
  metrics :signup

Next, we’re going to show some of our visitors a bigger sign-up link:

<% bigger = "font:14pt;font-weight:bold" if ab_test(:big_signup_link) %>
<%= link_to "Sign up", signup_url, style: bigger %>

Approximately half the visitors to our site will see this link:

Sign up

The other half will see this one:

Sign up

Interpreting the Results

An A/B test has two parts, we just covered the part which decides which alternative to show. The second part measures the effectiveness of each alternative. This happens as result of measuring the metric.

Remember that we’re measuring signups, so we already have this in the code:

class SignupController < ApplicationController
  def signup

We’re going to let the experiment run for a while and track the results using the dashboard, or by running the command vanity report.

Vanity splits the audience randomly — using cookies and other mechanisms — and records who got to see each alternative, and how many in each group converted (in our case, signed up). Dividing conversions by participants gives you the conversion rate.

Vanity will show the conversion rate for each alternative, and how that conversion compares to the worst performing alternative. In the example above, option A has 80.6% conversion rate, 11% more than option B’s 72.6% conversion rate (72.6 * 111% ~ 80.6%).

(These large numbers are easily explained by the fact that this report was generated from made up data)

It takes only a handful of visits before you’ll see one alternative clearly performing better than all others. That’s a sign that you should continue running the experiment. You see, small sample size tend to give out random results.

To get actionable results, you want a large enough sample, more specifically, you want to look at the probability. Vanity picks the top two alternatives and calculates a z-score to determine the probability that the best alternative performed better than second best. It presents that probability which should tell you when is a good time to wrap up the experiment.

This is the part that gets most people confused about A/B testing. Let’s say we ran an experiment with two alternatives and we notice that option B performs 50% better than option A (A * 150% = B). We calculate from the z-score a 90% probability.

“With 90% probability” does not mean 90% of the difference (50%), it does not mean B performs 45% better than A (90% * 50% = 45%). In fact, it doesn’t tell us how well B performs relative to A. Option B may perform exceptionally well during the experiment, not so well later on.

The only thing “with 90% probability” tells us is the probability that option B is somewhat better than option A. And that means 10% probability that the results we’re seeing are totally random and mean nothing in the long run. In other words: 9 out of 10 times, B is indeed better than A.

If you run the test longer to collect a larger sample size you’ll see the probability increase to 95%, then 99% and finally 99.9%. That’s big confidence in the outcome of the experiment, but it might take a long time to get there.

You might want to instead decide on some target probability (which could change from one experiment to another). For example, if you pick 95% as the target, you’re going to act on the wrong conclusion 1 out of 20 times, but you’re going to finish your experiments faster, which means you’ll get to iterate quickly and more often. Fast iterations are one way to improve the quality of your software.

You’ll want to read more about A/B testing and statistical significance

Multiple Alternatives

Your A/B tests can have as many alternatives as you care, with two caveats. The more alternatives you have the larger the sample size you need, and so the longer it will take to find out the outcome of your experiment. You want alternatives that are significantly different from each other, testing two pricing options at $5 and $25 is fast, testing all the prices between $5 and $25 at $1 increments will take a long time to reach any conclusive result.

The second caveat is that right now Vanity only scores the two best performing alternatives. This may be an issue in some experiments, it may also be fixed in a future release.

To define an experiment with multiple alternatives:

ab_test "Price options" do
  description "Mirror, mirror on the wall, who's the better price of them all?"
  alternatives 5, 15, 25
  metrics :signup

The ab_test method returns the value of one of the chosen alternatives, so in your views you can write:

<h2>Get started for only $<%= ab_test(:price_options) %> a month!</h2>

If you don’t given any values, Vanity will run your experiment with the values false and true. Here are other examples for rendering A/B tests with multiple values:

def index
  # alternatives are names of templates
  render template: Vanity.ab_test(:new_page)
<%= ab_test(:greeting) %> <%= current_user.name %>
<% ab_test(:features) do |count| %>
  <%= count %> features to choose from!
<% end %>

Weighted alternatives & Multi-arm bandits

By default, for n variations, Vanity will assign each of those with a probability of 1/n. If non-uniform weights for alternatives are desired, weights can be assigned to different alternatives. For example:

ab_test "Background color" do
  metrics :coolness
  alternatives "red" => 10, "blue" => 5, "orange => 1
  default "red"

This would make “red” 10 times as likely to appear as orange. (Note that these are weights, not percentanges, so the probability of assigning red above is about 62%, while blue is 31% and orange is 6%.) This is useful, for example, to assign a higher weight to a ‘control’ variation to ensure that most users continue having the default experience while assigning lower probabilities to test variations.

Multi-arm bandits

Another alternative to uniform splits of traffic is called “multi-armed bandits”, and the specific implementation included is Bayesian. In this mode, most traffic is sent to the currently best performing alternative (called exploitation). The minority of traffic is split between the available alternatives (called exploration). Since worse performing alternatives receive much less traffic, this leads to higher average conversion rates in bandit-driven experiments than traditional A/B split testing. The disadvantage is that in the worst case scenario, it can take a lot more traffic to declare a conclusion with statistical significance.

This can be enabled in the experiment definition, for example:

ab_test “noodle_test” do
alternatives “spaghetti”, “linguine”
metrics :signup
score_method :bayes_bandit_score
rebalance_frequency 100

Note: Setting the score method to `bayes_bandit_score` won’t adjust alternative probabilities (ie, you won’t get the benefit of maximizing conversions) unless you also set `rebalance_frequency`, which controls how many impressions must pass before rebalancing alternative probabilities.

Also note that impressions count is stored in-memory and is therefore reset upon restart of your app.

A/B Testing and Code Testing

If you’re presenting more than one alternative to visitors of your site, you’ll want to test more than one alternative. Don’t let A/B testing become A/broken.

You can force a functional/integration test to choose a given alternative:

def test_big_signup_link
  get :index
  assert_select "a[href=/signup][style^=font:14pt]", "Sign up"

Here’s another example using Webrat:

def test_price_option
  [19, 25, 29].each do |price|
    visit root_path
    assert_contain "Get started for only $#{price} a month!"

You’ll also want to test each alternative visually, from your Web browser. For that you’ll have to install the the Dashboard, which lets you pick which alternative is shown to you:

Once the experiment is over, simply remove its definition from the experiments directory and run the test suite again. You’ll see errors in all the places that touch the experiment (from failing to load it), pointing you to what parts of the code you need to remove/change.

Let the Experiment Decide

Sample size and probability help you interpret the results, you can also use them to configure an experiment to automatically complete itself.

This experiment will conclude once it has 1000 participants for each alternative, or a leading alternative with probability of 95% or higher:

ab_test "Self completed" do
  description "This experiment will self-complete."
  metrics :coolness
  complete_if do
    alternatives.all? { |alt| alt.participants >= 1000 } ||
      (score.choice && score.choice.probability >= 95)

When it reaches its end, the experiment will stop recording conversions, chose one of its alternatives as the outcome and switch every usage of ab_test to that alternative.

By default Vanity will choose the alternative with the highest conversion rate. This is most often, but not always, the best outcome. Imagine an experiment where option B results in less conversions, but higher quality conversion that option A. Perhaps you’re interested in option B conversions if you’re losing no more than 20% compared to option A. Here’s a way to write that outcome:

ab_test "Self completed" do
  description "This experiment will self-complete."
  metrics :coolness

  complete_if do
    score(95).choice # only return choice with probability >= 95

  outcome_is do
    a, b = alternatives
    b.conversion_rate >= 0.8 * a.conversion_rate ? b : a