辛普森悖论如何影响AB测试

科技2022-08-02 125

Simpson’s paradox occurs when we observe a certain trend in the aggregate data but not in the underlying segments that comprise the data. In the A/B testing domain, Simpson’s Paradox can occur when the overall mean conversion rate and/ or average order value of the experiences tested point to a result different from the mean conversion rates and/ or average order value of the underlying segments.

当我们观察到的数据总量有一定的趋势，但不包含数据的底层段发生了S impson的悖论。在A / B测试领域中，当所体验的总体平均转换率和/或平均订单价值指向与基础细分的平均转换率和/或平均订单价值不同的结果时，就会发生Simpson悖论。

Let me illustrate this with an example from Georgi Georgiev’s blog post, instructor at CXL. Suppose you run an A/B test between Page A and Page B and see the following results:

让我用Georgi Georgiev的博客文章 (CXL的讲师)中的示例进行说明。假设您在A页和B页之间运行A / B测试，并看到以下结果：

Aggregate A/B test results A / B汇总测试结果

Looking at the average conversion rate, it looks like you have a conclusive test with B beating A (assuming the sample size requirements, and other conditions such as statistical significance and power were met). But before you take that victory lap around the office, you see something completely unexpected. When you segment the data by the different traffic sources, you see that A has outperformed B for each traffic source!

从平均转换率来看，您似乎进行了B击败A的结论性测试(假设样本大小要求，并且满足其他条件，例如统计显着性和功效)。但是在您绕着办公室赢得胜利之前，您会发现完全出乎意料的事情。当您按不同的流量来源对数据进行细分时，您会发现A在每个流量来源方面的表现都优于B！

A/B Test results broken down by traffic source A / B测试结果按流量来源细分

What does this mean? How is this even possible? This is a classic example of Simpson’s Paradox.

这是什么意思？这怎么可能？这是辛普森悖论的经典例子。

是什么导致辛普森悖论？ (What causes Simpson’s paradox?)

Simpson’s paradox is essentially caused by weighted averages. In the example above, when we combine the results by traffic sources, the dominant traffic source for each of the variants heavily influences the aggregate conversion rates, thereby switching the direction of the results. In other words, the following two things happen:

辛普森悖论本质上是由加权平均值引起的。在上面的示例中，当我们按流量来源组合结果时，每个变体的主要流量来源都会严重影响总转化率，从而切换结果的方向。换句话说，发生以下两件事：

Page A’s conversion rate (5.6%) is heavily influenced by the conversion rate of Traffic Source 1 (5%) which accounts for 75% of its traffic.

Page A的转化率(5.6％)受到流量来源1的转化率(5％)的严重影响，该流量占其流量的75％。 Page B’s conversion rate (7.3%) is heavily influenced by the conversion rate of Traffic Source 3 (8%) which accounts for over 80% of its traffic.

Page B的转化率(7.3％)受到流量来源3的转化率(8％)的严重影响，该流量占其流量的80％以上。

The traffic source volume in this case is called a “lurking” variable or confounding variable. It is unevenly distributed between the experiences and is in fact responsible for the observed results. This can easily move our test dangerously close to comparing apples to oranges.

在这种情况下，流量来源量称为“潜伏”变量或混淆变量。它在体验之间分布不均，实际上是观察结果的原因。这很容易使我们的测试危险地接近将苹果与橙子进行比较。

辛普森悖论在A / B测试中的另一个实例 (Another instance of Simpson’s Paradox in A/B Testing)

Another way Simpson’s Paradox can creep into A/B testing is with what is known as “ramping up”. This occurs when the traffic allocation between experiences is changed.

辛普森悖论可以进行A / B测试的另一种方式是所谓的“提升”。当体验之间的流量分配发生更改时，会发生这种情况。

Ronny Kohavi from Microsoft shared an example wherein a website got one million daily visitors, on both Friday and Saturday. On Friday, 1% of the traffic was assigned to the treatment (i.e. the variation), and on Saturday that percentage was raised to 50%.

微软公司的Ronny Kohavi举了一个例子，该网站在星期五和星期六每天都有一百万的访问者。在星期五，将1％的流量分配给该处理(即变体)，在星期六，该百分比提高到50％。

Even though the treatment had a higher conversion rate than the Control on both Friday (2.30% vs. 2.02%) and Saturday (1.2% vs. 1.00%), when the data was combined over the two days, the treatment seemed to underperform (1.20% vs. 1.68%).

即使在星期五(2.30％比2.02％)和星期六(1.2％比1.00％)上，治疗的转换率都比对照高，但两天的数据合并后，治疗效果似乎不佳( 1.20％和1.68％)。

This is again because we are dealing with weighted averages. The data from Saturday, a day with an overall worse conversion rate, impacted the treatment more than that from Friday.

这再次是因为我们正在处理加权平均值。从周六开始的数据(转换率总体较差的一天)对治疗的影响要大于周五以来的影响。

Simpson’s Paradox due to change in traffic allocation between experiences 辛普森悖论归因于体验之间流量分配的变化

你如何避免呢？ (How do you avoid it?)

Make sure that the samples are completely randomized and free from bias which means a visitor coming to the page is equally likely to see any of the experiences. This will ensure that the distribution of visitors from different traffic sources, browsers etc is comparable across the experiences and the underlying differences in conversion rates do not unequally impact one experience more than the other.

确保样本完全随机且没有偏差，这意味着访问该页面的访问者同样有可能看到任何体验。这将确保来自不同流量来源，浏览器等的访问者分布在各种体验之间具有可比性，并且转换率的根本差异不会使一种体验比另一种体验受到更大的影响。

Make sure to send the test data to your web analytics tool (Google Analytics etc). This is not only important for post-hoc segmentation but also can give you a way to spot such bias early on in the test. Segment the experiences based on traffic sources, devices, browsers etc. to make sure that there are no confounding factors at play.

确保将测试数据发送到您的网络分析工具 (Google Analytics(分析)等)。这不仅对事后细分很重要，而且还可以为您在测试中尽早发现这种偏差提供一种方法。根据流量来源，设备，浏览器等对体验进行细分，以确保没有混淆因素在起作用。

“It’s (simpson’s paradox) a most startling example of what failure to segment by meaningful dimensions can lead to. “Segment, segment, segment!” is what this paradox teaches us.” as per Georgi Georgiev.

“这是(辛普森悖论)最令人震惊的例子，说明未能按有意义的维度进行细分会导致什么。 “细分，细分，细分！” 是这个悖论教给我们的。” 根据Georgi Georgiev。

If you are concerned about the impact of the test on website conversions, instead of changing the traffic allocation between the experiences after starting the test, you may want to allocate a lower % of traffic to the test to start with. Based on the stability and performance of the test, you can then increase the traffic to 100%. Should you absolutely need to start a test with different traffic allocation between experiences for any reason, start a new test when you are actually ready to test.

如果您担心测试对网站转化的影响，而不是在开始测试后不更改体验之间的流量分配，则可能要为测试分配较低的流量百分比。根据测试的稳定性和性能，您可以将流量增加到100％。如果出于任何原因您绝对需要使用不同体验之间的流量分配来启动测试，请在实际准备测试时开始新的测试。

Use stratified sampling which is the process of dividing members of the population into homogeneous and mutually exclusive subgroups before sampling. However, testing tools do not offer this.

使用分层抽样，这是在抽样之前将总体成员分为同质和互斥子组的过程。但是，测试工具不提供此功能。

我们如何决定A / B测试 (How do we decide A/B tests)

As per Georgi Georgiev, if we are already in such a situation

根据Georgi Georgiev，如果我们已经处于这种情况

the decision on whether to act on the aggregate or on the by segment data is up to the story behind the numbers, not the numbers themselves.

是否对汇总数据或按细分数据采取行动取决于数字背后的故事，而不是数字本身。

He suggests evaluating each pair of confounding variable and experience qualitatively. For example, we may end up retaining both the landing pages as they are performant for different traffic sources (based on seasonality etc.).

他建议定性评估每对混杂变量和经验。例如，我们可能会保留两个着陆页，因为它们对于不同的流量来源(基于季节性等)表现良好。

In order to do this in a data-driven manner, we could treat each pair as a separate experience and perform some additional testing until we reach the desired statistically significant result for each pair (currently we do not have significant results pair-wise).

为了以数据驱动的方式执行此操作，我们可以将每对视作单独的体验，并执行一些其他测试，直到获得每对对所需的统计上显着的结果(当前，我们没有成对的显着结果)。

额外资源 (Additional resources)

Simpson’s Paradox by minutephysics

微小物理学的辛普森悖论

Are University Admissions Biased? | Simpson’s Paradox Part 2 by minutephysics

大学入学申请是否有偏见？ | 辛普森悖论第2部分by分钟物理学

Simpson’s paradox on Wikipedia

辛普森在维基百科上的悖论

Segmenting Data for Web Analytics — The Simpson’s ParadoxBy GEORGI GEORGIEV

用于Web分析的数据细分—辛普森悖论GEORGI GEORGIEV

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web — Ron Kohavi, Microsoft

在网络上运行受控实验时应避免的七个误区-Ron Kohavi，微软

The top 3 mistakes that make your A/B test results invalid — Widerfunnel Blog

导致A / B测试结果无效的前3个错误-Widerfunnel Blog

Validity Threats to Your AB Test and How to Minimize Them — Invespcro Blog

AB测试的有效性威胁以及如何将其最小化-Invespcro Blog

翻译自: https://medium.com/swlh/how-simpsons-paradox-could-impact-a-b-tests-4d00a95b989b

相关资源：jdk-8u281-windows-x64.exe

Processed: 0.008, SQL: 8