Understanding A/B test results
Master the analysis of outcomes of an A/B tests conducted through the A/B Testing Hub. Learn what each number means and how you can use it to better understand and interpret test outcomes.
This article will help you
- Know what each number on the test outcome report means
- Understand the test progress graph of sequential tests
- Learn how to use the analysis to make good post-test inferences
Basic test information & actions
The top of the page features the test's name as it has been entered, as well as it's start date and end date. If the test is yet to finish, its planned end date is shown instead.
header of the test analysis page
The quick navigation below that is used to easily jump to parts of the test analysis. The sharing options on the right include the ability to create a shareable URL that can then be used to provide anyone with access to the test analysis page that you currently see. Alternatively, export the page as a PDF by utilizing the "Print as PDF" feature available in most modern browsers.
Test actions & navigation
At the very right is the three dots drop-down which includes key actions specific to the test at hand. You can:
- click "Test data" to view and edit the raw test data used to produce the test analysis. The option to reset the test is also there.
- go to "Test design" to view the statistical design parameters of the test (confidence threshold, power, etc.) and its expected duration under various conditions (sequential tests only). You can also duplicate the test by using the "Create a similar test" button on that screen.
- use the "Add test data" / "Manage data link" links to manually add data or view and adjust the data link parameters, if necessary.
- edit the basic test information such as the test's name, project it belongs to, and its start date (if the has no data collected or has just been reset).
- delete the test entirely
The test overview section contains the key information about an A/B or A/B/N test, starting with an indication of its current status. Hovering on it will reveal more information. For sequential tests the status can be one of:
- Pending - the test's start date is in the future, or it has to have it's data source configured.
- Ongoing (gathering data) - the test expects data to be entered / pushed using a data link
- Completed - the test has ended with an outcome in either direction
- Stopped - the test was manually stopped by the user for whatever reason.
For fixed sample tests there is either be a prompt to enter data, or the test outcome statistics are presented. If a sequential test is still ongoing raw data and preliminary estimates are shown, alongside an indication that the statistics suggest continuing the test until a boundary is crossed. If a sequential test has completed, final statistics are shown.
Key outcome statistics
main test outcome statistics
The estimated lift and estimated conversion rate or mean are key numbers to report. The estimated lift is the maximum likelihood estimate which in the case of a fixed sample test has the same value as the observed. Importantly, for sequential tests it includes adjustments to account for the different statistical model used.
These adjustments partially compensate for the bias in the raw data which is introduced by using sequential testing. The estimated lift for example may be lower or higher than the observed one, depending on when a test has stopped. The confidence, which is just 1 minus the p-value is likewise adjusted, and so are the confidence intervals.
The confidence intervals are useful in communicating the uncertainty surrounding the estimated effect size. Wider intervals mean more a wider margin of error is present. While you can think of the estimated lift as a best guess as to what the true lift is, the lower bound of an interval can be viewed as a lower bound on the effect size that can be inferred with the desired level of confidence. Another way to think of it is that it shows you what the upper range of a null hypothesis could be while still being rejected with the observed data.
If the test has more than one variant against a control (A/B/N) then all outcome statistics shown are for the best performing variant.
Please, note that calculating non-adjusted p-values and/or confidence intervals is a serious violation of the assumptions behind the statistical model of simpler tests where a fixed sample is assumed. It is often referred to as peeking and when sequential monitoring is unaccounted for in the computed statistic it completely destroys its credibility. Comparing values from any of our fixed sample calculators with what you see in a sequential analysis is meaningless.
Below the key outcomes are other important numbers such as the test's duration and its total sample size - the number of users or sessions in all variants and the control.
In case it is a sequential test, the efficiency gained compared to a fixed-sample test with the same design parameters are shown as well. For example, a 30% efficiency gain means that it took 30% fewer users (and likely 30% less time) to run the test than it would have otherwise. This translates directly to lower realized loses / higher realized gains, depending on the outcome of the test - loses in case the variant is under-performing the control and gains otherwise.
Test progress and stopping boundaries
For sequential tests the graph to right of the test overview show the test decision boundaries. A well-designed test should have an efficacy boundary and futility boundary similar to the graph below.
sequential testing boundaries
The test's z-score statistic is plotted as it moves between them. It shows how many standard deviations away the observed data is from the data expected under the null hypothesis. If it is sufficiently further away it may cross the efficacy boundary at which point the test is automatically stopped and a "winner" is declared. Note that for tests with multiple variants only the performance of the best performing variant at each stage is shown.
Should the statistic cross the futility boundary, the test would be automatically stopped with a failure to reject the null hypothesis if a bounding futility bound was chosen. If the bound is non-binding, a prompt appears asking whether you want to end the test or gather more data. It is generally suggested to end the test, unless there is compelling evidence external to the test data which suggesting that the boundary cross be ignored.
Mismatch between plan and reality
In case of significant mismatch between the A/B test's schedule and the actual number and timing of analyses, you may see an alert or warning containing information about analysis plan adjustments made by the engine. These typically include adding more analyses to compensate for a lower than expected rate of information accrual per unit time. For example, three analyses may be added to a test planned for eight analyses in order to reach the target number of users if by analysis eight there were just 60% of the expected users by that point.
Note that while such adjustments do not influence the validity of an A/B test, it is best to have a guess about the expected rate of users accruing to the test be as close to reality as possible.
In the opposite case where users are entering the test much quicker than expected, no adjustments would be made, but the test may end up taking much less time than expected, raising concerns about external validity.
For all tests a graph plotting three confidence intervals is shown, with a vertical line at the point estimate (estimated lift %).
effect size estimate and confidence intervals
The uppermost interval is a familiar two-sided interval constructed at a level of 2α where α is the chosen significance threshold. For example, if a test was planned with a confidence level of 95% (α = 5%), the two-sided interval will be a 90% (2 &middpt; 5% = 10%) confidence one. Below it, two one-sided intervals are plotted with a confidence level corresponding to the confidence threshold for the test.
Since all tests on the platform are one-sided tests, the corresponding intervals are also one-sided. One-sided intervals are also less prone to misinterpretation. However, since most users are familiar with two-sided intervals, a two-sided interval is also present, but its level is adjusted to match the hypothesis test and avoid confusion.
Table view of outcome statistics
The information show in the Test overview screen is also shown in a slightly expanded form as a table below it. It is most useful when one is conducting an A/B/N test for viewing the performance of each variant, since in the numbers so far only outcome statistics for the best performing variant is shown.
Trend graphs are plotted for all sequential tests. These can be used to spot unusual sharp turns in the statistics that might hint at a technical issue or a concern about generalizability.
cummulative observed data for each variant and the control
A sample ratio mismatch test is performed automatically and its output displayed in a small dedicated section. If the SRM p-value is below 0.001 the result is displayed higher up and is highlighted in red, signalling the need for attention.
A sample ratio mismatch detection should be treated very seriously as if real it would make any of the other statistics untrustworthy due to likely violations of the statistical model. Pinpointing the reason for an SRM can be quite challenging due to the many possible sources, but is a must, especially if one sees low SRM p-values with any regularity.
The executive summary is a text version of all of the A/B test's design parameters and outcome statistics. It is written in a way which makes it easy to understand how the test was planned, what it's goal was, and what the outcome says related to that goal. You may want to consult it whenever you are unsure about the interpretation of a number in the test analysis.
The technical notes
Those with a deeper interest in statistics may want to consult the technical notes accompanying each test result. They are customized depending on the test specifics and include information about the statistical models, tests, and applicable adjustments made in computing everything shown in the test results page.
- Always use the Estimated lift and Estimated conversion rate or mean instead of the raw outcomes
- Make use of confidence intervals to communicate uncertainty to clients or stakeholders
- Consult the Executive summary and the Technical notes for useful information
Last updated: Apr 06, 2023