A guide to experimentation and testing your product decisions

We all want to make good decisions. We can’t yet bounce back in time to fix our mistakes. Make informed decisions by testing your genius before going all in. A few of my thoughts on approaching testing.

Craig Wattrus
13 min readAug 15, 2018
Credit to Lauren Okura https://twitter.com/OKyouRA for these awesome little vector humans.

Botching up a change can cost you, time, money or sanity. Let me give you some reasons to quiver in your boots when it comes to making huge irreversible changes. Your changes could:

  1. Drive away existing or potential customers
  2. Lead to spending too many resources building the wrong thing
  3. Drive up support interactions
  4. Lose your business good will in the industry
  5. Trigger a larger system breakdown

I think we can all agree that any one of these potential fall outs is worth losing sleep over. This is where testing comes in. Testing alleviates these fears and allows me to rest soundly.

I will demonstrate and recommend a 4 step approach to testing. The 4 steps are:

  1. The anatomy of a good test — Locking down the basics
  2. Apply a structured approach to deciding how you’ll test
  3. Settle on what success or failure might look like and how you’ll know either way
  4. Choose your toolset carefully, set them up properly, follow conventions

Step 1: The anatomy of a good test — Locking down the basics

Testing is important. Testing the right thing is vital.

By making sure you’re testing the right thing you can:

  1. Feel more confident about your changes
  2. Increase the accuracy of your results
  3. Have more business impact with well thought out tests
  4. Get to testing faster
  5. Avoid adding additional work and cost to your experimentation

The best tests will have the following properties:

  1. A clear measurable goal — Always start with a well-defined solution for your problem or a strong hypothesis. Even more, give yourself means for actually measuring this success. If you need to implement analytics or collect different data, get that in place before you start down this path.
  2. Can stand on their own — You need to be able to trust the results of your test. Don’t test two similar things at the same time. If you change your pricing model and an important page about premium features at the same time, for example, you’ll have a hard time knowing which one drove success or failure in conversion. A good test stands on its own.
  3. A clear ROI — Even quick tests require time and effort to put into place. Make sure your time investment is worthwhile. A successful test can pay for itself many times over, either monetarily or with key learnings.
  4. Statistical Significance — Pay attention to how many users you need in your experiment to get a true result. A test in which you watch real users using your experiment needs fewer users than an automated test using analytics. Tools like Optimizely will let you know when you’ve reached statistical significance or how many users you still need to get there.

Step 2: A structured approach to testing changes

Once you’re set on what to test, you’ll need a clear way forward. I’ve done my best to distill some of my own learnings from this planning process.

In order to decide on the best way to test, you need to be able to answer (or approximate answers to) some questions. The answers will help you choose an approach that works best for your specific set of constraints.

Testing properties:

  1. Size: Is this a change to an existing implementation or a totally new feature/product?
  2. Audience: How many people will have an opportunity to interact with your test? This should be explored as a function of likely user base/potential user base.
  3. Urgency: Is your current setup performing extremely poorly or do you just think there’s room for improvement? If you’re trying something brand new, how important is it that this gets to market quickly?
  4. Business impact: If we get this wrong, will we lose customers, money, or goodwill?

I can’t tell you exactly how you’ll find these answers. Questions about audience and urgency can often be answered with the help of analytics tools. If you’re super familiar with your product and user base, it’s ok to base some answers purely on your gut instinct.

Using the questions you answered above, you can then pick one of these common testing approaches:

  • Big bang rollout: Roll out the change in production to live users, monitor it, and quickly releasing changes/fixes if anything goes wrong.
  • Controlled rollout: Roll out the change to a subset of users; monitor and potentially increase the size of the subset if we need more users to reach statistical significance.
  • Internal user rollout: Apply either of the first two approaches, but limit the experiment to internal users only.
  • User testing: Use a test environment to expose a small group of users to the change to observe their behavior.

While theoretically, we could build a huge matrix matching up our testing properties and possible testing approaches, I prefer a simpler approach.

Start with the most drastic rollout plan (in this case: Big bang rollout), and see whether any of your testing properties can convince you out of it. Then do the same with the next most drastic option until you have considered each one individually. This might sound crazy, but allow me to show you how it works with an example:

Example 1: For our marketing site, we wanted to change the structure of the top navigation bar.

Testing Properties for the marketing site:

  • Size: medium
  • Audience: potentially millions of users
  • Urgency: High; the changes were becoming a necessity
  • Business impact: large; if we fail it’d reduce conversion and lead generation

Testing approach logic:

  • Big bang rollout: This is a significant change for such a large potential audience, with a high risk if we fail. We simply cannot close our eyes and press the big red button.
  • Controlled rollout: I feel more comfortable with this option, as we will be limiting the audience size. Our risk is lower and, since this is urgent, we stand to gather results more quickly from our public audience than if we limited it to internal users.

We can stop at this point because we cannot convince ourselves out of the Controlled rollout.

Example 2: For our developer docs and help center, we wanted to try a more drastic change for navigating between the two sites.

Testing Properties for the developer docs:

  • Size: large; this change included more than structural navigation changes
  • Audience: potentially hundreds of thousands of users
  • Urgency: this is an improvement and not urgent
  • Business impact: small (no immediate business impact)

Testing approach logic:

  • Big bang rollout: For a large, non-urgent change, it doesn’t seem worth it to potentially having an untested experience reach everyone.
  • Controlled rollout: Although the business impact is low, the urgency is also low. So while this feels like a reasonable plan, the size of the change makes me unsure.
  • Internal user rollout: This feels better! We can quickly validate our idea and once we’re happy, we can move to a Controlled rollout to our customers.

And so we stop at this point because we cannot convince ourselves out of Internal rollout.

This approach might sound a little “feelings driven,” and you’re not wrong. That being said, my feelings here are based on an in-depth knowledge of the numbers and a lot of past tests we’ve done. If you don’t have this luxury, put on your lab coat and get some metrics.

Step 3 — Settle on what success or failure might look like and how you’ll know either way

Now that you know what you’ll test, you need figure out what it will look like when you’ve achieved what you set out to. From my experience, this part is either really hard or really easy.

OK sure, there’s probably a middle ground, but where’s the fun in that?

You can measure success in so many ways — sometimes it’s overwhelming to even get to one or two things you can concretely say would be a sign of success or failure. What I’ve found drives up failure with testing is either thinking too simplistically or unnecessarily complicating your measurements.

You’ll often see articles and advice saying choose one metric and stick with it. That isn’t terrible advice, though it could cause you to ignore other key signs of failure or success. In the same way, getting too caught up in the myriads of possible measures means that you could miss a simplistic measure that’s more valuable than all the others. My advice here is measure as many things as possible, form a loosely held opinion about their usefulness, and let the data guide you to what is most important during and after your test.

Example: We recently tested a new pricing page on our marketing site. Our goal for this page was to display some new content while maintaining or improving the conversion percentage. When the results came in, we had kept our conversion percentage the same, so we shipped the changes. It wasn’t until later that we realized that we’d missed the nasty increase in bounce rate on this page.

If I’d applied my own advice to this experiment, we might not have changed our metric collection and hypotheses on conversion. However, we’d have most likely considered bounce rate before calling it a success. We got caught up on a single metric without letting the data guide us to a sure sign of failure.

An approach to finding success metrics:

  1. Capture your basic metrics for every project, work these out once, and make sure to look at them as part of your analysis for every test. For a website, this could be bounce rate, retention, and/or time on page. For a payment method, this might be authorizations, failures, and/or transaction times.
  2. Think about and list out an extensive set of things that might be affected by your change.
  3. Prioritize a few of these as your main metrics; try to break out any multi-part metrics. For example, let’s say “Number of completed application forms” is your metric. Think about whether there are other metrics that could impact that number, such as “Clicked signup” or “Validation error”. Capture these as new metrics to track.
  4. If one or more of your main metrics come out in an unexpected way, dig into some of the seemingly less important metrics where the answer to the surprise might exist.

Metrics differ between every company, industry, and team. Here are a few if you need some inspiration!

Some popular metrics:

  1. Usage: Do people use the thing you made? Did changing it increase or decrease the people using the thing? Sometimes you want people to use something more, sometimes you want people to use it less.
  2. Conversion: Do you have a specific activity or task that must be completed? Does your change impact a part of the journey to complete this activity? Examples include things like completing a transaction, adding an item to a cart, clicking a signup button, or completing an application. Be aware that conversion is often not linear! Being more upfront about the requirements for applicants to your program might mean fewer people click to sign up, but more people submit quality applications. Ensure your conversion numbers are comprehensive; just tracking signups would lead you to a false negative in such a situation.
  3. Retention: Does your change increase or decrease the likelihood of the user coming back to your product/site/app?
  4. Page analytics: Bounce rate, exit rates, time on page, and session length are all popular page metrics. Keeping track of a few of these is usually recommended. Again, acknowledge the complexity of some of them! Whether or not a value is good or bad depends on the situation. For example, the bounce rate on a page which is supposed to be used in a flow is important to keep low. A page like a help center article, on the other hand, which users might get to from a Google search and then leave once they have the information, should have a relatively high bounce rate (depending on how your chosen analytics tool defines and implements bounce rate).

Step 4 — Choose your toolset, set it up properly, follow conventions

The three previous steps lay a strong foundation for structuring your test and knowing how to measure it. Now you will need a way to implement your test. Some teams I’ve worked with roll their own tools for testing; others use a single existing product. Some will use a plethora of tools all strung together to achieve more complex goals. No matter where you’re starting, it’s not hard to get something out there with minimal effort.

I’m reasonably familiar with the following situations:

  1. No tools in place to run even a simple test
  2. Custom tooling in place to run tests
  3. A single vendor provided tool in place to run tests
  4. Multiple vendor/internal tools in place to run tests

It’s important to point out these common situations because your test approach will differ slightly between them. I have some suggestions for things to watch out for in each case:

No tools in place to run even a simple test

You’re just starting out, this is AWESOME. You get to do this properly from the beginning! You will need to consider budget, time, and technical resources when you pick your approach. Rolling your own way to do feature flags might be the only option if you’re tight on budget but have technical resources. That being said, I’d strongly recommend using a vendor, as you will have the benefit of their massive joint experience and testing IP at your fingertips. The main tool I’ve worked with is Optimizely — they have tools for everything and have exceptional support if you can afford their enterprise level service. There are many other tools around and a simple Google search will give you plenty of options. Come up with your short and long-term requirements, and make a decision that will serve you now and in the future.

Custom tooling in place to run tests

There are a few instances in which you’d have your own tools. Sometimes legal restrictions, regulation, or privacy concerns don’t allow you to use vendor tools. Or maybe you just have a strong opinion about how your tests should be implemented. Either way, you have your own testing tools. It’s likely you don’t have as many testing options as someone using Optimizely or similar, so you’ll need to make sure you tailor your testing approach to fit your unique constraints. For example, your internal tooling might not support limiting test users to a set of IP addresses, meaning you cannot limit a test to a set of internal users only. Adjust accordingly. Consider moving to an experimentation tooling vendor if you need more power and don’t have any constraints.

A single vendor provided tool in place to run tests

This is the simplest position to be in. You will use your tool in the way it was designed and not face too much complexity. Consider adding to your toolset where it makes sense. What I’ve found with a tool like Optimizely is that I end up wanting to connect my test results into Mixpanel so that I can measure the wider-reaching impact of a test.

Multiple vendor/internal tools in place to run tests

If you’re in this situation, you’re poised to make the best possible experiments. You have flexibility, but you also have complexity. Make sure you understand how your tools work with each other. For example, if you are using feature flags in your application in addition to a vendor-provided tool to toggle those flags and an analytics tool to get more robust data, you’ve got a system with three potential points of failure. Test and test again.

For every situation you’re in, you need to consider one important thing: You must be able to measure the outcome of your test.

Setting up your tests properly by testing them

It’s happened to me a few times: we’ve set up an awesome test, run it for a few weeks or even months, and then only checked in on the analytics only to learn that we had missed out on tracking a key event. While we could still get some value in these scenarios, ideally, I’d have tested more thoroughly against my testing plan and picked up on these issues earlier.

Run through your test in as many different scenarios as possible. Try different browsers or try taking strange routes through your user flow — basically, be the most annoying user you could possibly imagine! Check that for each test you run, you were able to see every important behavior you need to get a robust result in your analytics. This might drive out additional events and metrics you need to track. If it does, celebrate that you caught it now and not a few days/weeks/months into your test.

Use conventions

There’s a good chance your team or someone in your company is using a convention when tracking metrics and setting up tests. If not- now’s the time to start! If so, follow those conventions.

In a company I worked with, we consolidated the Optimizely accounts for all the teams across the company, making all our tests visible to everyone. This helped us keep all our experiments, variations and events named in similar ways. We also created a joint analytics project in Mixpanel so that we could stay honest when naming events.

Not convinced about the value of conforming? Good on you, you rebel. But… there are times to conform and this is it. Think about a few months/years in the future when your CFO wants some very important numbers and your data team needs some behavioral data from your app/site to match to some data from accounts. If you’ve decided to call your customers “users” and the account team calls them “accounts,” you now have to map “users” to “accounts” or vice versus! Not fun.

You want to run an experiment from your marketing site onto your user dashboard, but you call page views “page view” on the marketing site and “pageview” on the user dashboard? Now you must add a layer of complexity for anyone using your data and for yourself when you want to make reports.

We recently created a working group of all the major users of web analytics in our company and called it the Behavioral Analytics team. We casually meet up once every few weeks and maintain a wiki space to share our conventions, knowledge, and discoveries. It’s driven us to learn more, create more exciting experiments, and allowed us to further our data-driven agenda into new areas of the company.

In conclusion

I’ve tried to outline some of the thinking I’ve done on analytics and experimentation above. It is by no means exhaustive or even necessarily right for your team, product, or company. I do, however, hope that it will spark some conversation or debates, and maybe just maybe it’ll be insightful for somebody.

Thanks to Grace Greenwood for teaching me about commas and making many many edits :).

--

--