Interpret Module
The xpyrment.interpret module contains submodules and components for interpret.
interpret
Experimental result interpretation, practical significance, and product launch decision-making.
This package provides a suite of high-level diagnostic and decision tools to help experimenters go beyond basic p-values, mapping raw statistical estimations directly to business utility and product decisions.
Submodules:
- decision: Integrates statistical and economic constraints to output structured launch recommendations.
- effect_size: Computes standardized scale-free effect sizes (e.g., Cohen's d).
- hte: Evaluates Heterogeneous Treatment Effects (HTE) and executes subgroup interaction screening.
- significance: Assesses whether observed lifts meet Minimum Valuable Effect (MVE) business targets.
| MODULE | DESCRIPTION |
|---|---|
decision |
Automated product decision logic and feature launch recommendations. |
effect_size |
Standardized effect size computation, focusing on scale-free difference metrics. |
hte |
Heterogeneous Treatment Effect (HTE) discovery and subgroup diagnostics. |
significance |
Practical vs. statistical significance evaluations and MVE boundaries. |
| FUNCTION | DESCRIPTION |
|---|---|
generate_launch_recommendation |
Generates automated ship, no-ship, or inconclusive launch recommendations based on statistical and economic bounds. |
compute_cohens_d |
Computes standard standardized effect size using Cohen's d formula. |
scan_subgroups_for_hte |
Scans demographics/segments to detect Heterogeneous Treatment Effects (HTE) across cohorts. |
check_practical_significance |
Verifies if the measured lift satisfies the minimal valuable business effect (MVE). |
generate_launch_recommendation
generate_launch_recommendation(
p_value: float,
relative_lift: float,
cost_threshold: float = 0.0,
) -> str
Generates automated ship, no-ship, or inconclusive launch recommendations based on statistical and economic bounds.
Translates statistical estimates and uncertainty intervals into actionable product decisions. Importantly, a statistically significant positive effect is not always sufficient to justify a product launch. Every feature introduces operational overhead, maintenance costs, and potential technical debt. Therefore, the decision engine incorporates an economic cost threshold (\(C\)) to evaluate profitability.
Mathematical Decision Boundaries
Let \(\\hat{\\theta}\) be the estimated relative treatment effect (relative_lift), let \([\\theta_{\\text{lower}}, \\ \\theta_{\\text{upper}}]\) be the \(1 - \\alpha\) confidence interval, and let \(C\) be the minimum acceptable relative improvement (cost_threshold) required to offset the feature's operational costs.
The recommendation engine maps these boundaries to four distinct decision states: 1. SHIP: The treatment effect is statistically significant (\(p < \\alpha\)), and the estimated lift exceeds the cost threshold: $$ \hat{\theta} > C \quad \text{and} \quad p < \alpha $$ (For a highly conservative strategy, we can assert that the worst-case benefit exceeds costs: \(\\theta_{\\text{lower}} \\ge C\)). 2. NO-SHIP (Uneconomic): The treatment effect is statistically significant (\(p < \\alpha\)), but the benefit is too small to justify the operational overhead: $$ \hat{\theta} \le C \quad \text{and} \quad p < \alpha $$ 3. INCONCLUSIVE (Underpowered): There is insufficient statistical evidence to reject the null hypothesis of no effect: $$ p \ge \alpha $$ This occurs when the sample size was too small to resolve the treatment effect, or if the true treatment effect is actually zero.
| PARAMETER | DESCRIPTION |
|---|---|
p_value
|
The calculated p-value of the primary metric test.
TYPE:
|
relative_lift
|
The estimated relative difference between treatment and control (\(\\bar{Y}_T - \\bar{Y}_C) / \\bar{Y}_C\).
TYPE:
|
cost_threshold
|
The minimum relative benefit required to warrant deployment (\(C\)). Defaults to 0.0.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
A structured recommendation string outlining the statistical and economic rationale.
TYPE:
|
Source code in src\xpyrment\interpret\decision.py
compute_cohens_d
Computes standard standardized effect size using Cohen's d formula.
Cohen's d (Cohen, 1988) is a standardized, scale-free effect size measure representing the difference between two group means in terms of standard deviation units. While p-values measure the statistical evidence against a null hypothesis (and are heavily dependent on sample size), Cohen's d measures the practical magnitude of the treatment effect, making it comparable across entirely different metrics or experiments.
Mathematical Formulation
Let \(N_A\), \(N_B\) be sample sizes, let \(\\bar{X}_A\), \(\\bar{X}_B\) be sample means, and let \(s_A^2\), \(s_B^2\) be the unbiased sample variances of the two experimental groups (Control A and Treatment B respectively).
The pooled sample standard deviation \(s_{\\text{pooled}}\) is defined as: $$ s_{\text{pooled}} = \sqrt{\frac{(N_A - 1)s_A^2 + (N_B - 1)s_B^2}{N_A + N_B - 2}} $$ The Cohen's d statistic is computed as: $$ d = \frac{\bar{X}B - \bar{X}_A}{s{\text{pooled}}} $$
Standard Classification Heuristics: - \(|d| < 0.2\): Negligible effect size. - \(0.2 \\le |d| < 0.5\): Small effect size (e.g., most successful digital A/B tests). - \(0.5 \\le |d| < 0.8\): Medium effect size. - \(|d| \\ge 0.8\): Large effect size (indicates highly impactful, structural interventions).
| PARAMETER | DESCRIPTION |
|---|---|
group_a
|
1D array of outcomes for Control (Group A).
TYPE:
|
group_b
|
1D array of outcomes for Treatment (Group B).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
The calculated Cohen's d statistic.
TYPE:
|
Source code in src\xpyrment\interpret\effect_size.py
scan_subgroups_for_hte
scan_subgroups_for_hte(
df: DataFrame,
treatment_col: str,
metric_col: str,
segment_cols: list,
) -> dict
Scans demographics/segments to detect Heterogeneous Treatment Effects (HTE) across cohorts.
The Average Treatment Effect (ATE) can often be misleading if different user subgroups respond in opposite directions. For instance, a feature might increase engagement for new users but severely degrade it for power users. Identifying these Heterogeneous Treatment Effects (HTE) is critical for personalized targeting and risk mitigation.
The Statistical Threat of Naive Subgroup Sweeping
A common mistake is to perform independent t-tests across numerous segments (e.g., checking 20 different countries). Doing so dramatically inflates the probability of false positives due to multiple testing: $$ \text{FWER} = 1 - (1 - \alpha)^g $$ where \(g\) is the number of subgroups. If \(g=20\) and \(\\alpha=0.05\), there is a \(64\\%\) chance of detecting a "significant" subgroup effect purely by random chance.
To prevent false discoveries, this module implements a two-stage diagnostic framework: 1. Global Interaction Filtering: Rather than running isolated tests on individual subgroups, we fit an integrated regression model containing an interaction term between the treatment assignment indicator \(T\) and the subgroup variable \(S\): $$ Y_i = \beta_0 + \beta_1 T_i + \beta_2 S_i + \beta_3 (T_i \times S_i) + \varepsilon_i $$ We only report subgroup-specific effects if the joint interaction coefficient \(\\beta_3\) is statistically significant (\(p < 0.05\)). 2. Causal Partitioning (Advanced): Uses algorithmic techniques (such as Causal Trees or Forests, Wager and Athey 2018) that recursively split the covariate space to maximize the difference in treatment effects between leaves, using sample-splitting to prevent overfitting and ensure honest confidence intervals.
Pseudocode for Subgroup HTE Sweeping
function scan_subgroups_for_hte(DataFrame df, String treatment_col, String metric_col, List segment_cols):
Initialize hte_results = {}
For each segment in segment_cols:
- Fit OLS: metric_col ~ treatment_col * segment
- Compute F-test for the significance of the interaction term.
- If interaction p-value < 0.05:
- Calculate specific treatment lifts and confidence intervals within each level of the segment.
- Add results to hte_results[segment]
Return hte_results
| PARAMETER | DESCRIPTION |
|---|---|
df
|
The experimental dataset.
TYPE:
|
treatment_col
|
Column containing treatment assignments.
TYPE:
|
metric_col
|
The outcome metric column.
TYPE:
|
segment_cols
|
List of categorical columns representing user segments (e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
A dictionary of detected heterogeneous treatment effects, including interaction p-values, segment-specific lifts, and confidence intervals.
TYPE:
|
Source code in src\xpyrment\interpret\hte.py
check_practical_significance
Verifies if the measured lift satisfies the minimal valuable business effect (MVE).
Online experimentation platforms often have massive sample sizes, which makes them highly powered. As a result, extremely microscopic differences (e.g., a \(0.05\\%\) lift in page load times) can yield highly significant p-values (\(p < 0.01\)). However, such a minor improvement may be practically or economically irrelevant, failing to justify the ongoing maintenance overhead of the new code. This function evaluates whether the observed effect size meets or exceeds a pre-defined practical threshold.
Statistical Significance vs. Practical Significance
Let \(\\hat{\\theta}\) be the estimated treatment effect, let \([\\theta_{\\text{lower}}, \\ \\theta_{\\text{upper}}]\) be its confidence interval, and let \(\\delta_{\\text{MVE}}\) be the Minimum Valuable Effect (MVE).
Three scenarios can occur when evaluating significance: 1. Statistically Significant and Practically Significant: The null hypothesis is rejected (\(p < \\alpha\)), and the estimated lift exceeds the MVE: $$ \hat{\theta} \ge \delta_{\text{MVE}} \quad \text{and} \quad p < \alpha $$ (Ideally, to be highly confident, we require the entire confidence interval to exceed the threshold: \(\\theta_{\\text{lower}} \\ge \\delta_{\\text{MVE}}\)). 2. Statistically Significant but NOT Practically Significant: The null hypothesis is rejected (\(p < \\alpha\)), but the magnitude is trivial: $$ \hat{\theta} < \delta_{\text{MVE}} \quad \text{and} \quad p < \alpha $$ In this case, the feature should generally be rejected despite its "significant" p-value. 3. Practically Significant but NOT Statistically Significant: The estimated point estimate is large (\(\\hat{\\theta} \\ge \\delta_{\\text{MVE}}\)), but we fail to reject the null hypothesis (\(p \\ge \\alpha\)). This indicates an underpowered experiment; the sample size was too small to confirm whether the apparent large effect is real or due to noise.
| PARAMETER | DESCRIPTION |
|---|---|
relative_lift
|
The estimated relative difference between treatment and control (point estimate).
TYPE:
|
min_valuable_effect
|
The minimum relative lift required to be practically valuable (\(\\delta_{\\text{MVE}}\)).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the estimated relative lift meets or exceeds the minimum valuable effect threshold.
TYPE:
|