Archive for the 'Terminology' Category

Breaking down multivariate testing (Billy’s Optimization Guide Part 2)

If you missed it, see Part 1 (A/B Split Testing).  Update: Part 3 on Rules for a Successful Multivariate Test is here.

The technical and statistical aspects of multivariate testing can be complicated but in order to design successful tests you don’t need to know everything, just the basics of how it works and some guidelines.  I’m assuming you already have some understanding of multivariate testing, however I want to cover the basics and make sure we’re on the same level before going into how to design good multivariate tests.

Check out the wireframe below.  Pretty standard for a landing page, right?  To properly design a multivariate test, we have to look at the page in a certain way.  Using three key terms, factors, levels and experiments, we can break down a test and describe its framework.

Factor: An element of the Web page (headline, image, text) being tested.  The element can also be groups of content, e.g. left column, button and hero shot together, or all banner ads on the page.

Level: Content that is assigned to a specific factor to be tested.  For example, one variation of a hero shot.

Below are 4 factors from our example page (headline, hero shot, offer and button) and then each of those factors with 4 levels represented by the different colors.  Note that the levels of one factor do not have to relate in anyway to the levels of other factors.

The last term, experiments, makes use of both factors and levels.

Experiment: A unique combination of levels used during a test.

Here you can see 4 different experiments.  Each experiment is different and holds different combinations of levels.  Note that there actually are many more variations (4×4x4×4=256 combinations).

Essentially a multivariate test involves showing these experiments randomly to live traffic, while tracking how each experiment performs.  The one that performs the best wins.  Each experiment is shown to many people, but each person only sees one experiment.  (There is some complexity in this, if you are still confused or want to know more, go to my primer on full and fractional factorial testing.)

In my next post, I will use these terms to outline the rules to creating a great multivariate test.

An Essential Primer on Full and Fractional Factorial Test Design

What are full and fractional factorial test designs? How do they relate to optimization and what about interactions?

Once you get down and dirty with testing, these questions matter. Whether selecting an optimization platform or trying to thoroughly understand the tests you are building, grasping these concepts will put you in greater control and allow you to design and analyze your tests more effectively.

As simply as possible, I hope to educate you and other marketers about full and fractional factorial test designs and why fractional factorial is the best choice for multivariate testing of online campaigns.

Note: “Partial factorial” and “fractional factorial” are the same. Also, if you don’t have a thorough understanding of experiments and interactions, please read those first.

The tests used in optimization are from the design of experiments field. (From Wikipedia: “Design of experiments is the design of all information-gathering exercises where variation is present, whether under the full control of the experimenter or not.”) The two types of tests I will focus on are fractional factorial and full factorial.

Here is an example I will use to explain these concepts. Below is a test matrix outlining a test for a landing page with 5 factors with 2 levels each. Don’t let the vocabulary scare you away, this means that there are 5 parts of the page being tested and 2 variations of each.

Recipe Matrix: 5 factors = 5 parts (hero shot, headline, etc.) and 2 levels = 2 variations

These factors and their respective levels make up the possible combinations for a landing page. The combinations displayed are called experiments.

Let’s calculate the total number of experiments possible (even if you know how to do this already, this is important to understanding the distinction between fractional and full factorial.) There are 2 levels for each factor, so you can have 2×2x2×2x2 (2 to the 5th power) = 32 possible experiments. This means there are exactly 32 combinations of hero shots, headlines, sub headlines, button text and main copy from our matrix outlined above. Note that if we add another factor, it becomes 2 to the 6th power or 64 possible experiments. Additionally, if you add 2 more levels to any of the existing 5 factors, it will increase from 32 to 4×2x2×2x2 = 64 experiments also.

In testing, each experiment must get a minimum amount of measurable conversions, known as the sample size per experiment. This ensures that there is enough data for a solid statistical analysis. Therefore the more experiments you have, the more conversions you need. You can think of conversion data as time also, since the longer you leave your web page up, the more data you get.

Now we’re ready to go back to the difference between the two test designs. Full factorial testing requires that every possible experiment combination is shown, so our 5-factor test would need to display all 32 experiments. This means that if there is a sample size of 100 conversions, 3,200 conversions will be required. Fractional factorial works differently, it displays a much smaller number of experiments, about 8 in this case, so it would need about 800 conversions.

Since full factorial gathers additional data, it reveals all possible interactions, but as seen by the numbers above, there is a trade-off. More data equals more information but more data also equals a longer test duration. The minimum data requirements for full factorial are very high since you are showing every experiment.

Even if you are using full factorial to get the same amount of information as a fractional factorial test, it will take more time since you need more data to see statistically relevant differences between the many experiments.

You might be wondering how fractional factorial can be accurate if interactions are possible?

Random interactions of high relevance are very rare, especially when looking for interactions of more than 2 factors. You really need to design tests where you look for meaningful interactions that are based on true business requirements rather than hoping for a random and low influence interaction between a red button, a hero shot and a headline.

Whatever the interaction is, you need to be able to understand your audience and infer why there was an interaction in the first place, only then are you ready to start designing for interactions.

Tests should not be filled with random levels, they should be carefully designed for success by focusing on testable hypotheses around the audience. Could a 1 pixel drop shade on a button interacting with the copyright statement ever be truly significant, and not a victim of random error? Is it worth sacrificing thousands of conversions to learn a lesson that won’t result in any relevant increase of real world conversions?

There are interactions that might make sense and those that should be avoided from being measured because of the amount of testing time it adds.

This brings me to fractional factorial. It is possible for fractional factorial tests to detect interactions. How so? Using our example of a 5-factor test, fractional factorial can include everything from only main-effects all the way to 4-factor interaction effects. Full factorial’s only difference is that it is the full extension and includes the 5-factor interaction effects.

Fractional factorial is not a one-trick pony, it is a continuum ranging from testing for no interactions (only main effects) to one factor less than full factorial. It is exactly what the name fractional implies; even one less is a “fraction” of full factorial. It gives you the power to make trade-offs between testing only main effects to testing for interactions based on intelligent test design.

Once you decide to test for all possible interactions, you are committing to a full-factorial test and incur the associated traffic requirements. I’d love to see a test design that is designed for full interactions and still makes sense! Not having the ability to reduce the number of interactions is a huge detriment rather than a benefit of solutions limited to full-factorial testing.

Radically shorter test times allow for many more smart marketing ideas to be tested and adapted based on what you learn from each test run. You, the marketer have the ability to analyze your results and tweak follow-on tests to capitalize on what you learn. This common-sense approach is what hypothesis-based testing is all about and is very powerful. Focus on testing smart ideas to increase your conversion rate – that’s what matters most.

The graph below illustrates how much information is gained and the amount of testing needed, based on the number of interactions tested.

In my experience, the red area shows how valuable the data is based on which effects are being tested, while the blue area shows the amount of data (or time) needed to gather the data to confirm those effects. The x-axis goes from left to right, from main effects to full factorial (5-factor effects).

At Widemile, we believe it is more effective to perform quick, successive tests detecting only main-effects rather than randomly hoping for interactions. While interactions might give you small or even large gains, it likely will never not trump the gains from additional testing, nor the time and money lost looking for random interactions. The additional time required for full factorial tests is large and not many marketers want to wait more than a month for a test to complete.

Fractional factorial is preferred by a few camps, including Widemile, Omniture’s Test&Target (formerly Offermatica) and Interwoven’s Optimost. Full factorial is used in Google’s free Website Optimizer and some tools offered by smaller providers.

Testing for all interactions sacrifices a lot of time. With the speed that audiences, marketing campaigns and seasons can change, it is important to get the most testing done in the least amount of time without sacrificing the quality of the data. Fractional factorial allows you to do just that, making it the wisest choice for multivariate testing.

Interactions

Web_Interaction_HD_frame5.png
I’ve written an extended definition for interactions in preparation for a long post about full and fractional factorial. Understanding interactions is a critical part to understanding full and fractional factorial also. I look forward to clearing up some misconceptions about fractional factorial test design. Hopefully I’ll have the post done next week.

CC photo credit: Zeno_

Optimization Glossary

Someone at the office has put together a great glossary, so I modified it slightly and have posted the glossary as its own page (it has a tab dedicated to it above now.) It is in a usable but not optimal form right now, so I’ll be updating it every now and then. I have also decided to add expanded definitions, in the form of separate pages dedicated to a single word. The first word to be done was “experiment.” Please check it out and let me know what you think.

In the past, I have stepped away from using technical language and jargon but, with this glossary, I will begin using the language I use at the office. My hope is to acclimate others and help them understand the terminology used by myself and others at Widemile and around the industry.

What is Taguchi? How does it relate to testing?

the Taguchi method

Multivariate testing is a buzz word these days, but the buzzword of multivariate testing seems to be Taguchi. However, that term is being abused. Do you know what Taguchi really means? I wasn’t even positive, so to get some background, I did some research and talked with Vladimir (Widemile’s Chief Scientist).

The name and method comes from Genichi Taguchi. His method, also known as Robust Design, attempted to improve product manufacturing quality. Therefore it falls into an area of engineering called Quality Engineering.

Does this sound aligned with website testing? Not really, and this is the problem of using the term Taguchi with web site testing. The goals of manufacturing and the goals of a website are not the same.

What most people are attempting to grasp when using the term Taguchi is fractional factorial test design. (I discussed this at length in my post about the difference between Widemile’s technology and Google Optimizer.) The Taguchi method uses a fractional factorial test design and is under the umbrella of fractional factorial testing but is not the only or best fractional factorial method. In fact, even within manufacturing, the Taguchi method was the inspiration for many new techniques but many statisticians find it flawed.*

It is important to differentiate the Taguchi method from fractional factorial test design since one is a basis for manufacturing while the other is purely related to design of experiments. You need to ensure that the math and science behind your testing is based on methods that have the end goal of optimizing your website only. So if your testing tool uses the Taguchi method for testing, you better ask what that really means.

So does Widemile use Taguchi? We don’t use the Taguchi method, however do use fractional factorial test design. I like to say that our platform goes beyond Taguchi because it was specifically made for optimizing web content.

Don’t get sucked into the Taguchi method, it is just a buzzword used by your fellow marketers. Just because the technology doesn’t use Taguchi, doesn’t mean you should count it out.

*Read more after the jump for Vladimir’s explanation of the Taguchi method and its criticisms
Continue reading ‘What is Taguchi? How does it relate to testing?’

Google Optimizer is slow (or Not all Multivariate Testing is the same)

*Update: Hello!  If you’ve found this article after reading the book Always Be Testing, I encourage you to take a look at a more recent and in-depth article I’ve written here: An Essential Primer on Full and Fractional Factorial Test Design.  Thanks for visiting!

Without knowing it, people might assume that there’s only one method to multivariate testing. That it has been long figured out by math and statistic wizards. I have learned otherwise from Widemile’s personal math wizard, Chief Scientist, Vladimir Brayman.

(Just as a side note, he does not have a typical office. Rather than papers and folders strewn about, he has statistic and math books. Lucky for me though, he has a great skill at distilling all the goodness in those books and teaching me what I need to know, in a way I understand.)

Most recently, we discussed why Widemile’s technology trumps Google Optimizer.

Widemile vs Google

Having a strong creative team and testing experts ensures better results than giving a marketer a tool like Google Optimizer, that’s easy for most people to understand. But explaining how Widemile’s technology can test more, faster, is a little more complicated.

Let’s explore how Google’s testing works versus Widemile’s. Google Optimizer uses full factorial test design, meaning it creates a page for every combination of your tested page elements. So if you wanted to test 4 different hero shots, 4 buttons and 4 headlines, that would require 4*4*4=64 page combinations. The disadvantage of this method is that you need significant traffic for each of the 64 pages. Meaning you either need a lot of traffic or a lot of time; for most companies, they’ll need both.

To solve this, Widemile’s optimization platform use fractional factorial test design. This method tests only a small fraction of the total possible page combinations and uses statistical analysis to derive almost all of the same information that would be found in a full factorial test. This works because marginal information is gained in testing all 64 page combinations, while testing a few important combinations tell us nearly everything we need to know.

Google actually criticizes fractional factorial test design (look here where it says “A note about ‘fractional factorial testing’”), saying that it requires the same number of impressions, but can not derive the depth of conclusions that a full factorial design can. While true that full factorial squeezes out the most information, that is at a sacrifice of extending the test many times longer than with a fractional factorial test, all to learn the smallest influences.

Doing successive tests to find high influence items with fractional factorial testing will get much higher gains than getting every ounce of information out of one extremely long full factorial test. In addition, with a carefully designed fractional factorial test you can learn all the major influences and the interactions between elements on the page.

Fractional factorial test design gets you a completed test in weeks rather than months or years even, and because of that, you can test more than you would normally be able to in the same time frame. You can either test more in one larger test, or do many smaller successive tests.

Not to say that Google Optimizer isn’t a great tool, especially since it is free, but any company that spends thousands of dollars on SEM has a lot to gain by using technology that gets rapid results.

If you got any questions about this, let me know and I’ll try to answer them or get you an answer.

Multivariate testing: a quick primer

Want to quickly get up to speed on multivariate testing? This post is designed to help you grasp the basics of multivariate testing so that you can get started talking and even doing your own tests. While its hard to design great tests, it’s easy to get good results with only a little education. I will definitely be teaching more about multivariate testing soon, but get this stuff down first!

Note: Since the industry is new, there isn’t consistency in much of the vocabulary of multivariate testing, so I will try to use generic terms.

Billy’s Multivariate Testing Primer

What is Multivariate Testing?: Testing multiple versions of a page to determine a set of elements* providing the highest conversion rate.

*Elements can be any content (text or image) on a page, typically hero shots, buttons, button text, headlines and text blocks. Sometimes it can refer to position/layout also.

What happens:

Multivariate testing diagram

  • Visitors come to a page and are shown a random version of the page
  • Conversions are tracked based on which page they saw
  • Once a statistically significant number of conversions is reached, analysis determines the elements on a page that create the highest conversion rate

Strengths:

  • If used correctly, it hones in on correct messaging direction and then exact messaging can be found using further testing
  • Allows for quicker testing of multiple elements than split or A/B testing
  • Analysis derived from live visitor data
    • Proves winning page elements to be better than others

Weaknesses:

  • Medium learning curve
    • Test results easily ruined by mistakes or poor methodology
  • Requires a minimum number of conversions
  • Code must be added to web page
  • Cookies and JavaScript must be enabled by visitors

Keep in mind:

  • Increasing the number of elements being tested increases the time needed for the test.
  • Number of conversions over time determines test size and length
  • The faster a page gets conversions, the shorter the test will be and/or the more things you can test
  • More conversions is better
  • A shorter time frame is better than a long time frame, but don’t go shorter than 2 weeks
  • Don’t test things that are too similar, look for different segments or messaging to pursue

Other concepts:

  • Using it with split testing:
    • Use split tests to determine the best layout with a template test. Test layout against layout with the same content (see this article for more advice.)
    • Afterwards, use multivariate testing to try out different messaging and refine it with continual tests.
    • If you want to try new layouts later on, go back to split testing.
  • Advanced: There are two types of multivariate tests: Full factorial and fractional/partial factorial
    • Full factorial means every version is shown. Meaning if you have 4 headlines and 4 buttons, that is 4×4=16 combinations, so 16 pages are used in the test.
    • Partial factorial is when only a portion of the total possible combinations are shown to visitors. This relies on statistical formulas and algorithms to determine the influence of the various factors since not every page combination is shown.
    • Full factorial takes a much longer time, so partial factorial allows for more testing in a quicker time frame.

There’s still a lot more to teach, even about some of the things mentioned here, but I hope this is enough to get you off the ground and really digging into testing. Google has their free tool if you want to try out multivariate testing.

5 quick tips to effective A/B and A/B split testing

While multivariate might be the hottest testing subject, you can’t beat a good split test in certain situations.

If you already know the difference between A/B and A/B split tests, skip this part, otherwise it’s a quick read.

A/B testing is when you test one page then replace it with a new page, so the two versions are running concurrently, one after another. A/B split testing is when you test two pages at once, where some distribution of traffic is sent to either of the pages simultaneously.

A/B Test

A/B Test

A/B split Test (note you can split different %’s)

Split test

Now for the good stuff! I’ll try to keep this short so let’s start:

  1. A/B testing is out: Use split testing instead. Split testing is more accurate since it uses the same time period of traffic. Traffic during Halloween is different from traffic during Christmas, so testing one page at one time and one at another will skew your results making a relevant comparison impossible. Use A/B only if you don’t have permission/the capability to do split testing (Google Optimizer is free and allows split testing!)
  2. Template test = A/B Split test: This is the sweet spot for split tests. Use one if you want to try a new layout/template against your old one or if you want to test two new pages. Here’s my template test primer if you need to brush up.
  3. One exception, one lesson: During these tests everything should be the same except for one thing on the page. If you try introduce 2 or more changes into a a/b split or a/b test you won’t know which change improved your page. The only time I might have multiple changes is for template tests, where the new template can’t use the previous creative effectively. Still, emulate the creative as closely as possible for the new template.
  4. Be ready for your next test: Since these tests are easier to execute, you should also have an easier time getting the next test ready to go for when the first one finishes. Make tests ahead of time so that when the current test completes you can flip the switch and quickly get it measured and done with.
  5. Learn from the first test: You already completed a test, what does that tell you about what you should test for the next time? If a graphic heavy template beat the cleaner template, try testing against an even more graphics heavy template. Find where your customers lie and pinpoint it by seeing what each test tells you. This is a game of Marco Polo. You customers are shouting, “Polo!” with each test, follow them!

A/B split tests keep it simple and that is its strength. As long as you control anything that might confuse the test (like introducing new content), you can find winners and make a great page. However, after a/b split testing, multivariate testing should be brought in to really pull out more from your page. But that’s a whole other blog….

Two types of tests (and I'm not talking MV or Split)

There are a lot of things to know to be a pro at testing, so I’ll cover some of the basics every week. Hopefully I’ll have a nice inventory of posts that can be used as a resource for you or anyone else to quickly learn the testing language and methods.

A good starting point to learning about testing methodology is to wrap your head around the idea of template and in-place tests.

A template test is a test using a new template in competition with the original template and is best done as a split test. A template test basically means you have all the same content but in different positions on the page. This is an example template:

Template 1

If this is my original template and I am doing a template test, I would switch the items around, but have the same exact content (hero shot, headline, price, etc.) to make a new template like this:

Template 2

These two templates would compete in a split test. A note though, even if I only move one thing, say I bring the Call-to-Action Button above the First Party Validation, it is still considered a template test.

However, if all I am doing is swapping out the content in those positions then it is an in-place test. Usually a multivariate test is an in-place test, trying out new headlines, images and text within the positions of the original content.

Typically you want to do a template test first to find the best positioning and then an in-place test to find the best messaging.

Continue on for more in depth examples
Continue reading ‘Two types of tests (and I'm not talking MV or Split)’