Experiments, good, bad and confusing examples

November 26, 2019

Experiments, good, bad and confusing examples

As a follow up to searching for value this post will be a short trip down memory lane of how we took our first steps towards becoming at least fairly skilled in finding value as a team.

We never had doubts as to why wanted to be good at experimenting, for us the joy of experimenting and learning was most of the value. The initial frustration when our first attempts weren't successful was extremely high. We had tools that helped us do A/B tests but the process was slow and awkward and didn't really yield all the insights we wanted, looking for gold is not that fun if you don't recognize it when you see it. We figured out how to improve the setup of gathering metrics but we weren't really successful at evaluation of multiple metrics, we started building dashboards and feeding them from Google Analytics and BigQuery but they only gave vague hints of direction not strong indications of how users reacted to our tests. We knew the ambition level was high and when we discussed compromises such as only testing the important changes and similar strategies frustration built again.

Eventually two engineers started the seed for the tool that we would put a lot of effort into (and still do) to gradually become our testing platform. Without making this an attempt to beat Beowulf (epic hero story) it's fair to say that building it required a very particular set of skills, the drawings on the whiteboards and the discussions where hard to follow. This tool was summarized as a problem at the same complexity level as the problem we where trying to help our users with. So was it worth it, and is it still? Yes, oh yes, because what it did was that it allow all of the team to look at the tests. It's not trivial and we still need to help each other but at least we don't need to be expert data analysts in the team.

Enthusiastic of our new capability we started testing all kinds of things and it turned out that once we had a machine we trusted, we still had a lot to learn and a lot of frustrating test results to discuss. We did some tests to just calibrate and make sure that we would find value where there weren't any (ei null tests where no change was introduced and no significant difference should exist).

In the early days we did a fair amount of ablation tests where we simply removed a feature we had made earlier to see if it was actually valuable. Since we had an application built under a few years this was very fast and easy although perhaps not that fulfilling. We did find things that where very low value but considering how little A/B testing had been done before it was surprising to see that these tests where often negative. At some point we ran several of those in parallel and at some point we probably over did it creating combinations of tests where almost no user experience remained.

Technical performance of the system was also A/B tested, the assumption is that faster is always better, let's see how much better! However we quickly realized that optimizing specific features indeed leads to increased attraction and usage and although intuitively good it can steal to much focus from other things, so a balanced approach is needed. And of course we tried to opposite, artificially slowing things down, which mostly did what we expected and drove users away from those features. The key learning from this was that everything must be looked at in it's context, there are very few things that are for certain.

For many tests we started out with fairly good ideas and ended up with confusing results. A common outcome of those tests was that we needed additional metrics. So we started growing the number of metrics in our tests, from 20 to 30, then 40. There was a temporary fear that the number of metrics would snowball out of proportion but lately we are seeing diminishing returns on additional metrics and the pace at which we add them is decreasing.

After happily testing and either including or discarding the test into our code base we have slowly started to use initial test results for spawning follow up test. This has already proven powerful, initial ideas that feel good shouldn't be quickly dismissed, if the first test is bad in some ways can the result hint at ways to improve it and remove bad side effects to extract good value? Once we get better at this it will be a great saving of time if we can consistently create very small and cheap experiments and figure out from those where and how to invest effort to find high value.

We have found our fair share of bugs too, we have a bunch of tests that where cancelled after a single day of running, simply because the results where obviously bad. After some inspection bugs where found or basic assumptions had to be challenged.

We've also had quite a few discussions of things worthy of testing or effort spent on bad ideas. But as we see tests as cheaper and cheaper we have reached to point where we settle for testing regardless, we more frequently have the more interesting thought "if this bad idea tests successfully, what do we do then?". As we are early in the product cycle it's still more important to test something quickly and iterate on interesting things than spending a lot of time on what might have big impact. Eventually we will probably develop a fairly good gut feeling on potential value but for now the wise thing is not to trust our biased minds and test instead.

Despite being as data driven as we dare there are still "other" things that must be done. Strategic decisions beyond our scope in which we are but a piece of the picture. There are some basics that are just expected and "must be there", and there is of course also a ton of engineering we must do to enable us having a great platform on which to run our experiments. It's hard to say what's in the future but regardless of the product goals, how would it be if we could run the most awesome experimentation platform ever? A place where people would be able to pitch in their wildest and craziest ideas that we could transform into tests without delay?

Be humble, we don't know, let's experiment!

Search This Blog

Principles for digital transformation

Experiments, good, bad and confusing examples

Comments

Post a Comment

Popular Posts

Decision making models

Edge computing