Creating a new A/B test - a complete guide
Learn how to design a statistically rigorous A/B test which optimizes the business return regardless of its outcome. The article covers the creation of a new test using the A/B testing hub and the choice of key test parameters.
This article will help you
- come up with a good name for your A/B test
- assign A/B tests to projects
- specify the type of test hypothesis
- extract historical data needed for sample size estimation
- intelligently choose a test's duration and confidence level
Creating a new A/B test in the A/B testing hub is the recommended approach for planning and analyzing online experiments with Analytics Toolkit. This allows you to employ our state-of-the-art sequential testing engine and to design tests for optimal ROI. Even if you prefer to stick to fixed-sample tests and want to choose all statistical parameters yourself it is best to create the test in this manner so you can have it as part of your collection of tests for easy reference and meta-analyses.
General test information
To begin, log in and navigate to the Dashboard or the A/B testing hub, then click on the "Create new A/B test" button. You will proceed to the General Test Information form as shown below:
basic test info screenshot
Basic meta information
For the test name make sure to use a descriptive one that is unique enough so that when you need to find the test in the future it would be easy to do so. Including a test ID is often a good practice if you already have a good naming convention. Sometimes including the test date may be helpful, but the interface allows sorting and filtering by the test start date and end date, so it should not be needed. An example of a good name would be something like "FEUX-224 PDP CTA placement". "FE" is short for "Front End", "UX" means the test is part of a user experience change and 224 is a unique identifier for the test. The rest describes which page type a test is running on and what is the main change in the tested variant (PDP for "Product Details Page", and CTA for "Call to Action").
Since most of the time the test would already be named in the A/B testing delivery platform, it is best to use the same name in Analytics Toolkit for easier cross-referencing.
You can also specify a project to which to assign the test from the list of your projects. Projects are useful for organizing groups of A/B tests so that you can have an overview of them with a single click, or to perform meta-analyses on them with ease. A project can be a client (if you are a CRO agency), or a part of the website, or a type or category of tests ("Front-end", "Back-end", "Shopping Cart", "Search algorithm", etc.).
A project can also be created on the spot by selecting "Add a new project" from the project drop-down.
Framing the test duration
The difference between the hard deadline and the start date is used to limit the possible test duration in the final step of the test creation flow. For example, a hard deadline 28 days after the start date means the engine will not present or explore any test duration greater than 28 days.
The starting date is the expected date on which the test will be launched. If data is to be extracted automatically using one of our data connections then this needs to be set precisely or updated once known for sure.
In some cases you would want to create a test in Analytics Toolkit for a split test already in progress. That is not an issue, as the starting date can be in the past. It is also helpful if you want to re-analyze one of your already completed tests using our statistical engine.
The hard deadline is a date at which the test should absolutely have ended so that a decision to implement a variant or stick to the control can be made. This date is very important as it is used to limit the search of an optimal test duration in Advanced mode, as well as to show only relevant possible durations in Basic mode.
Not specifying this date may lead to a worse experience at the last stages of the test creation wizard. For example, ROI might benefit from a test which is a couple of weeks longer, but our optimizer engine will only explore a possible test duration which does not go beyond the specified hard deadline. This might result in missed opportunities. In case there is no set hard deadline, you can set it generously into the future.
The basic statistical parameters determine the characteristics of the statistical hypothesis test and any adjustments that will be applied to estimates such as the lift estimate, p-values, and confidence intervals. These parameters serve as basis for computing more advanced parameters such as the test’s statistical power, its optimal significance threshold, optimal duration, and others, in the last step of the A/B test creation wizard.
Test hypothesis type
The test type is the way to specify the 'win condition' of a test. By choosing simple superiority a demonstrated positive effect of any size is sufficient to prefer a variant over the control. This is the textbook use of tests where the goal is to demonstrate lack of harm and lack of equivalence. It is suitable whenever there are:
- no ongoing costs associated with adopting a winning variant
- no large upfront costs for implementing a variant, on top of conducting the test itself
specifying the null and alternative hypothesis
A non-inferiority hypothesis should be used when a variant could be worse than the control by a specified margin and still be preferred. Such scenarios typically entail some savings accrued by implementing a variant, which are not captured by the main test metric. For example, one might be saving on server infrastructure costs, support costs, shipping costs, logistics costs, etc. none of which is typically captured in a test\'s primary metric. See more at [t]non-inferiority test[/]
To specify the non-inferiority margin, use the slider to select a value, or type it in directly into the input field. The visual representation below will update on the fly to give you a better idea of the choice you are committing to. For example, setting the margin to 5% means that you would be happy to implement a variant even if its actual performance is -4.99% compared to the control.
Number of test variants (A/B/N)
specifying the number of treatment groups
This is a straightforward control for selecting the number of so-called 'treatment groups' that are going to be tested against the control. For the most straightforward A/B test where one group is the control and there is one more group with changes implemented, simply leave it to the default selection of "1" (variant). For an A/B/N test, select the corresponding number of test groups.
It is important to note that testing more groups against the control either increases the required test duration or increases the minimum detectable effect (a.k.a. minimum reliably detectable effect), assuming the confidence threshold and all other parameters stay the same. Therefore, it is important to not add extra variants lightly, especially if the differences introduced are trivial. It may increase the chance of a false negative with no benefit to offset that risk. In such cases, consider running a follow-up test.
Note that the number of variants cannot be changed once a test has started. Dropping variants is not an option, nor is adding more variants.
Baseline characteristics of the primary metric
The primary metric (primary KPI) is what is used to determine the outcome of the A/B test: implement a variant or stick with the status quo. It can be a straightforward ratio such as a conversion rate or a composite metric incorporating the output of things like LTV models, etc.
An often used continuous metric is average revenue per user (ARPU) whereas all types of conversion rates are binomial metrics. Note that if users convert more than once the resulting metric is necessarily continuous (e.g. average transactions per user).
Ideally, the specified baseline should be the expected average value of the metric for the duration of the test. That is, if you have predictive models, it is best to use their output here. If not, a simple extrapolation from the average over the past several weeks or several months should suffice in most cases where extreme shocks or seasonal swings are not expected during the A/B test duration.
If the metric is a rate, then you only need to enter two parameters:
specifying a binomial baseline
The standard deviation (σ) is analytically computed based on the conversion rate.
If it is a different type of average, then the standard deviation should be computed separately. It requires that individual data points are stored for each session / user and can be made available.
There are several ways to input the average and standard deviation. The most straightforward one is to simply enter them directly.
Directly entering the summary statistics is useful if a different software has already computed them from the raw data. In such a case you do not need Analytics Toolkit to do these computations for you.
entering the average and standard deviation directly
A file upload is a versatile option which can be used with any platform which can output raw data that can be converted to a CSV, TSV or XLSX (Comma Separated Values, Tab Separated Values, Excel spreadsheet). The file should contain the raw data in its first column, one value per row. If .xlsx is uploaded, it should be saved with the relevant spreadsheet open (if there are multiple ones).
uploading a file
The file is immediately processed and the values of the average and standard deviation fields are filled automatically. An error will be displayed in case of upload or parsing issues.
If the metric is average revenue per user or average revenue per session and the data storage is Google Analytics, a link to the Google Analytics transactions / sales performance report can be pasted and the relevant data will be extracted automatically. Obviously, you should have your Google Analytics account linked and the relevant property enabled through the Manage Data Sources screen.
specifying a Google Analytics report
Note that the most important thing is that you load the report with the correct time window, and the correct segment select (the default segment is "All Users" which is fine if that is the target audience for the A/B test). Once you paste the URL, a message will appear indicating data extraction is under way. It should take a few seconds, then the average and standard deviation fields should be populated with the correct values.
A fourth option is to paste raw data into a text field. It is a method of last resort, but will work just as well. Make sure each value is in a separate row, one row per user or session.
pasting raw data
As with the previous two extraction methods, the fields will populate immediately upon pasting the data and a "Data parsed." message will appear in its place.
At this step an important choice is to be made. Either use our Advanced planning approach to arrive at an optimal statistical design in terms of duration and confidence threshold or go through the Basic planning to select the characteristics of the statistical test yourself. In all cases, make sure to adjust test characteristics so the statistical test matches the change being tested and the business consequences of committing an error.
Advanced planning (ROI-optimal test)
Advanced planning introduces business considerations into the statistical plan of an A/B test. Incorporating business information allows our engine to examine different possible test plans and arrive at an optimal plan and present it in terms of the significance threshold and the test’s duration. The presented plan is optimal in the sense that it achieves the best possible balance between the various business risks and the possible positive effects. Learn more about A/B tests planned for optimal ROI by reading the glossary entry on risk-reward analysis and the referenced articles.
A lot of the business information is optional, but the more you enter, the more accurate the outcome will be. Be sure to consult the popover information for each field for an in-depth guide on how to make correct use of it.
By using the Advanced flow, a user can arrive at two test plans, each ROI-optimal in its category. One is optimal among the possible single evaluation tests, the other among the possible sequential evaluation tests. The latter typically outperforms the former, but the difference can vary depending on how many analysis stages are possible within the optimal test duration window.
comparison of optimal ROI test options
The leading characteristic here is the marginal improvement which shows how much the risk vs. reward ratio is improved by testing versus simply implementing the proposed change outright. It is an estimate of the value added by testing as opposed to implementing the change straight away.
It is key that this calculation is done with the assumptions used in previous steps. If these are aligned well with those of key stakeholders and decision-makers and the improvement is greater than zero, then they should be on-board about running the test and abiding by its decision. This framework guarantees optimal decision-making with the assumption that the input parameters are accurate to reality.
In some cases it may be difficult to obtain or estimate these inputs for all tests and not all cases warrant such a rigorous approach. This is why we also offer the classic basic planning approach, despite recommending the Advanced flow for most online experiments.
Basic planning (classical test planning)
Basic planning means no business details to enter, but this also means that the optimality of the A/B test with regards to the business objectives can be questionable due to the user relying a lot on their intuition in the choice of a significance threshold and test duration. In such cases users tend to resort to default values which can significantly reduce the value of an experiment, depending on the type of decision it is to inform.
The basic interface should be familiar from as it requires that you specify what type of monitoring will be used, and a required confidence level (confidence threshold). Notably, it is not necessary to specify a minimum effect of interest or the level of statistical power.
In therms of monitoring schemes the A/B testing hub offers two distinct possibilities. Sequential monitoring uses the flexible and efficient AGILE statistical method which should be preferred in most situations. Single evaluation monitoring is a classic fixed sample test where you only evaluate the outcome once at a prespecified moment of time or observed sample size. It has its benefits, but comes with significant practical limitations.
choice of A/B test statistical properties
When choosing the confidence level, it is important to consider what is at stake with the test in question. High impact decision would typically require a higher confidence threshold (lower uncertainty) while more trivial decisions such as quality assurance of changes to non-critical components may be tested with a higher level of uncertainty to facilitate a faster implementation cycle. In general it is best to perform a risk-reward analysis of some kind, or just use the Advanced flow which has such analysis at its core.
Once these selections are made, click on "Estimate test plan" to examine various scenarios and determine how long to run the test for based on the achieved statistical power (sensitivity) at different test duration / sample size options. Different possibilities for a test duration and hence sample size will be presented, with calculated minimum detectable effect at several power levels of typical interest.
determining the test-duration / power
Clicking on each duration / sample size option reveals links that can be used to plot the full power function at that sample size (maximum sample size, if Sequential), as well as the expected actual sample size of AGILE tests under different hypothetical true effect sizes. Examining the power function is helpful in determining the capability of the test to reveal true discrepancies of varying magnitudes. Hovering the graph will reveal the exact power at different true effect sizes. Exploring the AGILE sample size gives an idea of the time savings and efficiency gained through sequential monitoring. Hover the graph for detailed information.
choice of AGILE test characteristics
If a sequential monitoring test is chosen, the frequency of the analyses should be chosen. Sticking to a seven-day period between analyses is recommended to normalize any day-of-week effects, but any value between 1 day and 90 days can be chosen by the user. Note that the test may stray from this strict schedule (depending on the data entry mechanism) without statistical validity issues. However, this may come at the cost of problems for the generalizability of the A/B test results.
Advanced options such as choosing a futility boundary type and the type of power calculation used if there are more than one test variant will be presented here. Please, consult the tooltips for detailed information on these choices, but these should generally only be useful to advanced users and are hidden by default.
- use the Advanced planning flow to arrive at a statistical design with optimal business ROI
- select a Sequential evaluation test to gain efficiency and flexibility to stop it early if results are overly positive or negative
- choose a Single evaluation test when you are certain there will be no need to call a test before reaching its target sample size
What to watch out for
- the starting date of a test is very important if using any automated data extraction method. Make sure to update it if necessary.
- specifying a hard deadline is optional, but if you fail to specify it a sixteen week deadline is assumed
- a test's plan cannot be altered once it has been saved
Last updated: Feb 04, 2022