An R package for A/B testing leveraging pre-period data

What does it do?

The abpackage R package implements PrePost, a Bayesian approach for the estimation of the treatment effect in A/B testing. When pre-period data are available, the method leverages the pre-period to get a more accurate estimate of the treatment effect.

How does it work?

For each metric, the names “pre” and “post” indicate the periods before and after the start of the experiment, respectively. The names “control” and “treatment” indicate the two condition groups.

First, the method estimates the mean and variance of the metric in the pre-period. Second, it estimates the means and variances in the post-period conditionally on the estimate of the mean in the pre-period.

For each metric, PrePost returns the estimate of the percent change between the mean of the treatment and the mean of the control in the post-period. Additionally, PrePost also computes the difference between the mean of the treatment and the mean of the control in the post-period.

1. Example: single metric

Let’s generate and plot some synthetic data. In this case the true percent change is 0.4% (0.8 / 200 = 0.004):

set.seed(1)
n <- 20
mu.pre <- 200
mu.trmt <- 0.8
mu.ctrl <- 0
trmt.pre.data <- rnorm(n, mu.pre)
ctrl.pre.data <- rnorm(n, mu.pre)
trmt.post.data <- rnorm(n, mu.trmt) + trmt.pre.data
ctrl.post.data <- rnorm(n, mu.ctrl) + ctrl.pre.data
data <- data.frame(pre = c(ctrl.pre.data, trmt.pre.data),
                   post = c(ctrl.post.data, trmt.post.data),
                   condition = factor(c(rep("control", n),
                                        rep("treatment", n))),
                   metric = rep("my metric", 2 * n))

ggplot(data, aes(pre, post, color = condition)) + geom_point()

Now, we can estimate the percentage change between treatment and control using the function PrePost. The credible interval contains the true percent change.

PrePost(data)
## Significant tests (p.threshold = 0.05, p.method = none): 1 out of 1 (100.00%).
## 
## 95% credible intervals for (%) percent change between treatment and control:
##       metric  2.5%   50% 97.5% p.value significant
##  1 my metric 0.085 0.394 0.706   0.012          * 
##  Significant metrics are identified by *.

We can compare the result with the model where the pre-period is omitted. The true percent change is still contained in the credible interval, but the interval is substantially wider:

PrePost(dplyr::select(data, -pre))
## Significant tests (p.threshold = 0.05, p.method = none): 1 out of 1 (100.00%).
## 
## 95% credible intervals for (%) percent change between treatment and control:
##       metric  2.5%   50% 97.5% p.value significant
##  1 my metric 0.054 0.517 0.982   0.029          * 
##  Significant metrics are identified by *.

2. Example: multiple metrics

Let’s generate some data from 10 hypothetical metrics using the SampleData function. We assume a 1% increase in the treatment group for the first 3 metrics, and a 1% decrease in the treatment group for the fourth metric. For the remaining 6 metrics we assume that there is no difference between the treatment and the control. We fix the pre-post correlation at 0.8, which is commonly observed in experiments on large-scale online services.

set.seed(1)
n.metrics <- 10
n.observations <- 20
mu.pre <- 100
sigma.pre <- 1
rho.ctrl <- 0.8
rho.trmt <- rho.ctrl
mu.ctrl <- mu.pre
trmt.effect.inc <- 1.01
trmt.effect.dec <- 0.99
no.trmt.effect <- 1.00
mu.trmt <- mu.pre * c(rep(trmt.effect.inc, 3), trmt.effect.dec, rep(no.trmt.effect, 6))
sigma.ctrl <- 1.8
sigma.trmt <- sigma.ctrl
data <- SampleData(n.observations = n.observations,
                   n.metrics = n.metrics,
                   mu.pre = mu.pre,
                   sigma.pre = sigma.pre,
                   rho.ctrl = rho.ctrl,
                   rho.trmt = rho.trmt,
                   mu.ctrl = mu.ctrl,
                   mu.trmt = mu.trmt,
                   sigma.ctrl = sigma.ctrl,
                   sigma.trmt = sigma.trmt)

Let’s look at the data.

head(data)
##      metric       pre      post condition
## 1 metric 01  99.37355 101.09040 treatment
## 2 metric 01 100.18364 102.10915 treatment
## 3 metric 01  99.16437  99.87722 treatment
## 4 metric 01 101.59528 101.14870 treatment
## 5 metric 01 100.32951 102.14390 treatment
## 6 metric 01  99.17953  99.75791 treatment

Now, we estimate the treatment effect for each of the 10 metrics using the function PrePost. For each metric, the function PrePost computes the credible intervals and identifies whether the test is statistically significant after correcting for multiple testing. In fact, when testing several hypotheses it is recommended to use a stricter criterion than the classical “does it overlap with zero?” to avoid too many false positives. Multiple comparison is based on the p.adjust function from the base stats package in R. The desired method can be passed to the function using p.method, and the default is p.method = "none", i.e., no correction. The desired threshold can be passed to the function using p.threshold, and the default value is p.threshold = 0.05.

The method correctly detects the ~1% increase for the first 4 metrics.

(ans <- PrePost(data, p.method = "BH"))
## Significant tests (p.threshold = 0.05, p.method = BH): 4 out of 10 (40.00%).
## 
## 95% credible intervals for (%) percent change between treatment and control:
##        metric   2.5%    50%  97.5% p.value significant
##  1  metric 01  0.178  0.878  1.585   0.013          * 
##  2  metric 02  0.683  1.445  2.202   0.000          * 
##  3  metric 03  0.472  1.187  1.910   0.001          * 
##  4  metric 04 -1.854 -1.114 -0.371   0.003          * 
##  5  metric 05 -0.854 -0.055  0.752   0.892            
##  6  metric 06 -1.406 -0.645  0.119   0.098            
##  7  metric 07 -0.774 -0.052  0.676   0.887            
##  8  metric 08 -0.619  0.124  0.872   0.741            
##  9  metric 09 -1.251 -0.399  0.465   0.360            
##  10 metric 10 -0.900 -0.031  0.843   0.944            
##  Significant metrics are identified by *.

In the plot below, the barplot shows the 95% credible intervals for the percentage change between the treatment and the control for each of the 10 metrics. The significant metrics are plotted in green/red (positive/negative), while the non-significant metrics are plotted in grey.

plot(ans)

If we only want to plot the metrics that are statistically significant, we can use the input only.sig = TRUE. This can be particularly useful if you are testing a large number of hypotheses.

plot(ans, only.sig = TRUE)

Let’s repeat the analysis without using the pre-period. In this case only 1 of the 4 impacted metrics is identified.

data.no.pre.period <- dplyr::select(data, -pre)

(ans.no.pre.period <- PrePost(data.no.pre.period,
                              p.method = "BH"))
## Significant tests (p.threshold = 0.05, p.method = BH): 1 out of 10 (10.00%).
## 
## 95% credible intervals for (%) percent change between treatment and control:
##        metric   2.5%    50%  97.5% p.value significant
##  1  metric 01 -0.092  0.955  2.017   0.076            
##  2  metric 02  0.924  2.060  3.212   0.000          * 
##  3  metric 03  0.251  1.522  2.808   0.018            
##  4  metric 04 -2.515 -1.308 -0.093   0.034            
##  5  metric 05 -0.838  0.292  1.437   0.610            
##  6  metric 06 -0.926  0.361  1.670   0.579            
##  7  metric 07 -1.445 -0.038  1.372   0.956            
##  8  metric 08 -0.948  0.183  1.322   0.750            
##  9  metric 09 -2.116 -0.811  0.522   0.230            
##  10 metric 10 -1.543 -0.246  1.056   0.710            
##  Significant metrics are identified by *.
plot(ans.no.pre.period)

Looking at the plot, one might wonder why 3 credible intervals do not overlap with zero, but only 1 is identified as statistically significant. This is due to the multiple testing correction. In this example the Benjamini and Hochberg correction is used.

3. Reshape data

Data pulled using a sql language often have a column for each metric.

head(data)
##   cond obs period  metric 1 metric 2  metric 3  metric 4
## 1 ctrl   1  after 102.08223 100.5648 100.71208 100.76281
## 2 ctrl   1 before 102.00572 100.7372 101.13497 100.77209
## 3 ctrl   2  after  98.93540 102.2627 101.35395  99.20767
## 4 ctrl   2 before  97.92943 102.3213 101.11193  99.85916
## 5 ctrl   3  after 101.96787 100.7990  99.77438 101.20842
## 6 ctrl   3 before 103.05574 100.3489  99.12922 100.39309

The data can be reshaped using the function ReshapeData. The name and levels of variables can also be passed to the function in case the input data do not have the canonical names and levels.

reshaped.data <- ReshapeData(data,
                             observation.col = "obs",
                             condition.col = "cond",
                             condition.levels = c("ctrl", "trmt"),
                             pre.post.col = "period",
                             pre.post.levels = c("before", "after"))

head(reshaped.data)
##     metric condition      post       pre
## 1 metric 1   control 102.08223 102.00572
## 2 metric 1   control  98.93540  97.92943
## 3 metric 1   control 101.96787 103.05574
## 4 metric 1   control  99.60563  99.73865
## 5 metric 1   control  99.85335  99.54561
## 6 metric 1   control 100.96532 100.15756

4. Check pre-period balance

PrePost assumes that the distributions of the control group and the treatment group are identical in the pre-period. The function PreCheck can be used to make sure that there is no systematic bias between the two groups in the pre-period.

Let’s generate data from 100 hypothetical metrics using the default values of the SampleData function.

set.seed(1)
n.metrics <- 100
data <- SampleData(n.metrics = n.metrics)
pre.period.check <- PreCheck(data)
head(pre.period.check)
##       metric p.value misalignment
## 1 metric 001 0.84866             
## 2 metric 002 0.18896             
## 3 metric 003 0.52048             
## 4 metric 004 0.68500             
## 5 metric 005 0.40038             
## 6 metric 006 0.07114            *

If pre-period observations were generated independently across metrics, and identically across conditions within each metric, then 5% of metrics would be expected to be classified as “*" (light misalignment, 0.05 < p-value < 0.10), 4% of metrics would be expected to be classified as “**" (medium misalignment, 0.01 < p-value < 0.05), and 1% of metrics would be expected to be classified as “***" (heavy misalignment, p-value < 0.01).

Let’s see what these percentages look like for our dataset.

table(pre.period.check$misalignment) / n.metrics
## 
##         *   **  *** 
## 0.94 0.04 0.01 0.01

The proportion of misaligned metrics is consistent with what we would expect in a balanced pre-period.

Now that we have verified that the pre-period is balanced, we can move on and analyze the metrics with PrePost.

ans <- PrePost(data)

5. Model assumptions

PrePost assumes that in the pre-period observations within the control group and the treatment group are identical distributed. Specifically, they are Normally distributed \[ X_{i,j} \sim Normal(\mu_0, \sigma_0^2), \] where the index \(i\) represents the observation and \(j\) represents the condition group. Specifically, \(j=1\) indicates the control group and \(j=2\) indicates the treatment group.

In the post period, observations within the control group and the treatment group are independent but not identically distributed across groups \[ Y_{i,j} \sim Normal(\mu_j, \sigma_j^2). \]

PrePost leverages the correlation between the pre-period and the post-period \[ cor(X_{i,j}, Y_{i,j}) = \rho_j \] to get tighter credible intervals and more accurate point estimates than classic post-period based approaches.

6. Manuscript

Soriano J. Percent Change Estimation in Large Scale Online Experiments. arXiv, 2017, 1711.00562. https://arxiv.org/abs/1711.00562