Note: This is the third post of a 3 part series, each focusing on one type of test result that is tough to deal with. Read the other 2 articles on highly mixed data and the original page beating the new variations.
Ready for the toughest of all test results? I brought in Widemile’s Chief Scientist, Vladimir Brayman, for this post to help me with some of the concepts around this topic. The last of the three results is when the results just won’t stabilize.
How does this happen?
As long as you have homogeneous traffic and enough time, a test should stabilize. Unfortunately, this is not always possible and I don’t know anyone with unlimited time. The most obvious way this occurs is when a test is designed too large, meaning you don’t have enough conversion traffic for the number of variations you are trying to test.
Additionally, getting homogenous traffic is not always easy. If your sources are too different, you can have problems. Text, banner, e-mail ads and even Yahoo vs Google traffic may behave differently. The worst case is when these sources of traffic are added mid-test. I have had tests where an e-mail campaign was done at the end of a test without my knowledge (until I asked about the huge spike in traffic!)
You can’t control all traffic coming to your page from some sources like PR, blogs, seasonal events and news. This goes back to part 1, about highly mixed data; everything there applies to this case too.
A test also may not stabilize because the test is designed with elements that are too similar. The same thing can happen when 2 elements are different but have approximately the same amount of impact. In these situations, your data will go back and forth on which of them are the winners.
Anything outside of your page that has a large influence can destabilize your test, this includes pieces of your funnel. One symptom of this is when your clickthroughs are fairly consistent but the full conversions are not. If you are testing a landing page and the sign-up process after it is very kludgey and difficult for users then it can have a large impact on your tests’ ability to stabilize. This is especially true if the experience for visitors changes. An example of this is visitors bailing from a purchase funnel because shipping to their area is prohibitively more expensive than other areas. Although they would have converted if shipping was within the average price range, they ended up not converting because of something encountered outside of the landing page, skewing your results. This is in almost every test, but the magnitude of its impact depends on what exactly occurs.
What can you do to prevent this?
If you are using a testing tool different from what you normally track your conversions with, make sure you run a baseline test so that you can compare the numbers your testing tool gives you with the ones your conversion analytics produces. They should be within about 10%-15% of each other over about a week or so. Finding a large discrepancy here will save you from headaches down the line. This essentially double checks the expected traffic numbers by ensuring you are measuring your current conversion correctly, which allows you to design a test of the appropriate size. By size, I mean ensure that you have enough testing time and within that time you will get enough traffic.
While easier said than done, it is important to look for new traffic that may be driven to your page and to segment it out. Since this shares some of the same problems as highly mixed data, those solutions apply here too.
First, don’t cut your tests short unless you think more data won’t solve the problem. If you don’t reach stabilization, you are wasting all the time you tested since you have inconclusive data. Always try to be as conservative as possible and end tests only when you are very confident that the test is stabilized or that there is no other choice.
Think about restarting the test if it isn’t stable. Use a smaller design. Pick the important factors (pieces) and the levels (variations) that you think will perform and are drastically different from each other. This prevents elements from looking unstable as they flip flop as the optimal.
If your only problem is that 2 variations are vying for the winning position, then they likely perform about the same. It probably is not really worth your time to wait for them to stabilize and so stopping the test and going with either of them likely will have little difference to your conversion rates.
The problem of outside funnel influence is a bit harder, but not impossible to solve. The best solution is to segment the users that are determined to be unqualified. For example, if you only ship or work with US customers and businesses, then filter out any users that are outside of the US and do your analysis from there. This can be done either at the data level if you can tell where the data came from, otherwise this can be done with a splitter or qualification page that leads people into the appropriate funnel first. This may impact your overall conversions itself though, so careful testing around these methods should be done as well.
From my experience, the problems I’ve listed in these three posts are either preventable or unlikely to occur. The value of having an optimization expert is because they can avoid these situations or at the very least extract useful lessons when they do happen. Having said that, don’t be scared to test. Once you get the hang of it, it is a lot of fun and one of the keys to effectively growing and maturing your online marketing campaigns.