Blog

Read, Watch and Listen.

6 Things That Can Go Wrong In Firebase A/B Testing

6 Things That Can Go Wrong In Firebase A/B Testing

A/B testing is a powerful tool for mobile app developers who want to optimize their user experience and increase conversions. Firebase, Google's mobile and web app development platform, makes it easy to run A/B tests on your app and measure their impact on user behavior. However, there are several key things to remember when running A/B tests with Firebase to ensure that your results are accurate and meaningful.

Dima KaysinDima Kaysin8 min read

1. Activation events matter

When setting up an experiment in Firebase, targeting is one of the key parameters.
Targeting fields and Activation events can limit your experiment to a particular subset of users.

With targeting, users who do not satisfy selected criteria are excluded from the experiment and see no experiment variants.

Activation event works differently – users are exposed to experiment variants regardless of whether they've triggered the selected activation event or not; however, only those users who triggered it are counted by Firebase when presenting results.

Therefore, if your Activation event is rare, it should not be surprising that many more users will be exposed to the experiment than counted towards the result in the Firebase console.

For example, if you want to serve a paywall with 90% price discount to users who invited more than 90 friends to the app, it might not be a good idea to target them using the Activation event set to "invited_90_friends".


2. Remote config may not load when you think it does

Experiments in Firebase are powered by remote config.

A specific implementation of fetching and activating the remote config within the app can directly impact the experiments since a user enters the experiment only after the remote config is fetched and activated.

There are a few remote config loading strategies, however, for the purposes of A/B testing, it is usually beneficial to always have the most up-to-date config upon app startup.

One common issue related to this arises when the Activation event happens before the remote config is fetched and activated. In this case, the variant config is still being served to the user, but they are not taken into account by Firebase when presenting results.

For example, this will happen if you try to target new users by setting the Activation event to "first_open”.
First open is an automatically-collected event that is generated when a newly installed app contacts Firebase server for the first time.

In this case, the first_open event will most likely be triggered before your app can load any remote config parameters, and you won’t be able to gather any sizable sample size for your experiment.

For filtering new users, as an alternative to first_open, you can set the activation event to the start of onboarding (assuming the remote config is loaded by this point) or target by first open time.


3. Mind the A/B test-related user experience

If you run price experiments or experiments with different paid packages (e.g., monthly/yearly subscriptions vs. weekly/quarterly), there is a good chance that some of your users will see different offers at different points in time. This can sometimes lead to users’ confusion or even frustration.

To avoid that, you can frame the alternative variants in experiments in a manner that is consistent with the base variant.

For example, if you just go from $9.99 to $8.99 in an experiment variant, some new users who were only exposed to the $8.99 offer are set up for a surprising price hike once the experiment ends.
In this case, serving them a 10% limited-time discount makes sense.


4. What the hell is “Modelled Data”?

Firebase automatically analyzes the experiment and presents results in two panels – Observed and Modeled data.

While Observed data is pretty self-explanatory, Modelled data is more opaque since it is based on Bayesian modeling.
When it comes to statistics, most people have better familiarity with the frequentist approach that uses such concepts as hypothesis testing, confidence intervals, and p-values.

Firebase mainly uses the Bayesian approach because frequentist interpretation of experimental results is less intuitive than Bayesian interpretation IF you want to be precise in your statements.
For example, when applying the frequentist approach to the comparison of conversion rates between two variants, the resulting p-value is NOT the probability that conversion is the same between variants.

It is also NOT the probability that the observed conversion rates happened purely by chance.
Correct frequentist interpretation gets a bit technical and can be easily misinterpreted.

Bayesian inference that is used by Firebase, on the other hand, provides a direct answer to the question: “What is the probability that variant X is better than variant Y?”.
This probability metric can be used as a measure of confidence in the experiment results.

To summarize, if you do sufficiently many experiments, each of them having a winning variant with a 90% probability of beating the base, you can expect, on average, to correctly select the winning variant 9 out of 10 times.


5. When Firebase console is not enough, integration with BigQuery can give you more details

Firebase / Google Analytics has strong integration with BigQuery, which is Google's data warehouse solution, allowing you to save raw events to BigQuery.
A free tier permits exporting up to 1 million events per day.

With a smart approach to event setup and limited export to the most relevant event types, this quota can go a long way, even for medium-sized apps. However, beware of additional charges you may incur on the BigQuery side.

The data from A/B testing can be easily integrated into the data saved in BigQuery. Having detailed experiment data in BigQuery is excellent if you want a more granular view of your users' behavior.

You can add extra dimensions to your analysis, such as comparing the performance of your variations by state/city, device model, or day of the week. Additionally, with raw data in BigQuery, you can dissect each step of your funnel, which provides in-depth insight into user behavior and more ideas for your next experiments.

6. Deal with BigQuery integration BEFORE running experiments

It's important to note that you cannot save data from previous experiments in Firebase to BigQuery, so it's necessary to set up data export beforehand.

Exported raw events data provides many otherwise unavailable possibilities, but using it requires additional effort.

The exported data only includes raw events, so nothing is pre-computed, including conversion rates, probabilities, and model outputs.

The user needs to create their own methods to derive insights from raw data, which requires understanding Firebase event schemas and writing SQL queries to obtain even the most basic experiment data.
Looker Studio is useful for analyzing raw events data, particularly because of its native BigQuery connector.

Conclusion

Here at BlueThrone, we believe that running A/B tests is a must if you want to truly maximize your app’s potential.
However, doing it right can be a challenging and time-consuming task that requires knowing your users and having experience with the best practices and tools.
We’ve addressed a few of the gotchas of Firebase A/B testing, but there is so much more to it, so we will continue looking into A/B testing in more detail in the upcoming articles.

Dima Kaysin

Dima Kaysin

I’m deeply engaged and passionate about driving app growth and decision-making through data, I build robust systems and tools at BlueThrone. My expertise in product analytics, data engineering, and financial modeling, coupled with my background in project management and investment analysis, fuels the development of innovative, data-driven solutions in the thriving app industry.


More Posts: