Interpret Module

The xpyrment.interpret module contains submodules and components for interpret.

interpret

Experimental result interpretation, practical significance, and product launch decision-making.

This package provides a suite of high-level diagnostic and decision tools to help experimenters go beyond basic p-values, mapping raw statistical estimations directly to business utility and product decisions.

Submodules: - decision: Integrates statistical and economic constraints to output structured launch recommendations. - effect_size: Computes standardized scale-free effect sizes (e.g., Cohen's d). - hte: Evaluates Heterogeneous Treatment Effects (HTE) and executes subgroup interaction screening. - significance: Assesses whether observed lifts meet Minimum Valuable Effect (MVE) business targets.

MODULE	DESCRIPTION
`decision`	Automated product decision logic and feature launch recommendations.
`effect_size`	Standardized effect size computation, focusing on scale-free difference metrics.
`hte`	Heterogeneous Treatment Effect (HTE) discovery and subgroup diagnostics.
`significance`	Practical vs. statistical significance evaluations and MVE boundaries.

FUNCTION	DESCRIPTION
`generate_launch_recommendation`	Generates automated ship, no-ship, or inconclusive launch recommendations based on statistical and economic bounds.
`compute_cohens_d`	Computes standard standardized effect size using Cohen's d formula.
`scan_subgroups_for_hte`	Scans demographics/segments to detect Heterogeneous Treatment Effects (HTE) across cohorts.
`check_practical_significance`	Verifies if the measured lift satisfies the minimal valuable business effect (MVE).

generate_launch_recommendation

generate_launch_recommendation(
    p_value: float,
    relative_lift: float,
    cost_threshold: float = 0.0,
) -> str

Generates automated ship, no-ship, or inconclusive launch recommendations based on statistical and economic bounds.

Translates statistical estimates and uncertainty intervals into actionable product decisions. Importantly, a statistically significant positive effect is not always sufficient to justify a product launch. Every feature introduces operational overhead, maintenance costs, and potential technical debt. Therefore, the decision engine incorporates an economic cost threshold ($C$) to evaluate profitability.

Mathematical Decision Boundaries

Let $\\hat{\\theta}$ be the estimated relative treatment effect (relative_lift), let $[\\theta_{\\text{lower}}, \\ \\theta_{\\text{upper}}]$ be the $1 - \\alpha$ confidence interval, and let $C$ be the minimum acceptable relative improvement (cost_threshold) required to offset the feature's operational costs.

The recommendation engine maps these boundaries to four distinct decision states: 1. SHIP: The treatment effect is statistically significant ($p < \\alpha$), and the estimated lift exceeds the cost threshold: $$ \hat{\theta} > C \quad \text{and} \quad p < \alpha $$ (For a highly conservative strategy, we can assert that the worst-case benefit exceeds costs: $\\theta_{\\text{lower}} \\ge C$). 2. NO-SHIP (Uneconomic): The treatment effect is statistically significant ($p < \\alpha$), but the benefit is too small to justify the operational overhead: $$ \hat{\theta} \le C \quad \text{and} \quad p < \alpha $$ 3. INCONCLUSIVE (Underpowered): There is insufficient statistical evidence to reject the null hypothesis of no effect: $$ p \ge \alpha $$ This occurs when the sample size was too small to resolve the treatment effect, or if the true treatment effect is actually zero.

PARAMETER	DESCRIPTION
`p_value`	The calculated p-value of the primary metric test. TYPE: `float`
`relative_lift`	The estimated relative difference between treatment and control ($\\bar{Y}_T - \\bar{Y}_C) / \\bar{Y}_C$. TYPE: `float`
`cost_threshold`	The minimum relative benefit required to warrant deployment ($C$). Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`str`	A structured recommendation string outlining the statistical and economic rationale. TYPE: `str`

Source code in src\xpyrment\interpret\decision.py

def generate_launch_recommendation(p_value: float, relative_lift: float, cost_threshold: float = 0.0) -> str:
    r"""Generates automated ship, no-ship, or inconclusive launch recommendations based on statistical and economic bounds.

    Translates statistical estimates and uncertainty intervals into actionable product decisions.
    Importantly, a statistically significant positive effect is not always sufficient to justify a product launch.
    Every feature introduces operational overhead, maintenance costs, and potential technical debt.
    Therefore, the decision engine incorporates an economic cost threshold ($C$) to evaluate profitability.

    Mathematical Decision Boundaries:
        Let $\\hat{\\theta}$ be the estimated relative treatment effect (relative_lift), let $[\\theta_{\\text{lower}}, \\ \\theta_{\\text{upper}}]$
        be the $1 - \\alpha$ confidence interval, and let $C$ be the minimum acceptable relative improvement (cost_threshold)
        required to offset the feature's operational costs.

        The recommendation engine maps these boundaries to four distinct decision states:
        1. **SHIP**:
           The treatment effect is statistically significant ($p < \\alpha$), and the estimated lift exceeds the cost threshold:
           $$
           \\hat{\\theta} > C \\quad \\text{and} \\quad p < \\alpha
           $$
           (For a highly conservative strategy, we can assert that the worst-case benefit exceeds costs: $\\theta_{\\text{lower}} \\ge C$).
        2. **NO-SHIP (Uneconomic)**:
           The treatment effect is statistically significant ($p < \\alpha$), but the benefit is too small to justify the
           operational overhead:
           $$
           \\hat{\\theta} \\le C \\quad \\text{and} \\quad p < \\alpha
           $$
        3. **INCONCLUSIVE (Underpowered)**:
           There is insufficient statistical evidence to reject the null hypothesis of no effect:
           $$
           p \\ge \\alpha
           $$
           This occurs when the sample size was too small to resolve the treatment effect, or if the true treatment effect
           is actually zero.

    Args:
        p_value (float): The calculated p-value of the primary metric test.
        relative_lift (float): The estimated relative difference between treatment and control ($\\bar{Y}_T - \\bar{Y}_C) / \\bar{Y}_C$.
        cost_threshold (float): The minimum relative benefit required to warrant deployment ($C$). Defaults to 0.0.

    Returns:
        str: A structured recommendation string outlining the statistical and economic rationale.
    """
    if p_value < 0.05:
        if relative_lift > cost_threshold:
            return "SHIP: Lifts are statistically significant and exceed deployment costs."
        else:
            return "NO-SHIP: Statistically significant but falls below economic margins."
    return "INCONCLUSIVE: No statistical evidence to assert a positive lift."

compute_cohens_d

compute_cohens_d(
    group_a: ndarray, group_b: ndarray
) -> float

Computes standard standardized effect size using Cohen's d formula.

Cohen's d (Cohen, 1988) is a standardized, scale-free effect size measure representing the difference between two group means in terms of standard deviation units. While p-values measure the statistical evidence against a null hypothesis (and are heavily dependent on sample size), Cohen's d measures the practical magnitude of the treatment effect, making it comparable across entirely different metrics or experiments.

Mathematical Formulation

Let $N_A$, $N_B$ be sample sizes, let $\\bar{X}_A$, $\\bar{X}_B$ be sample means, and let $s_A^2$, $s_B^2$ be the unbiased sample variances of the two experimental groups (Control A and Treatment B respectively).

The pooled sample standard deviation $s_{\\text{pooled}}$ is defined as: $$ s_{\text{pooled}} = \sqrt{\frac{(N_A - 1)s_A^2 + (N_B - 1)s_B^2}{N_A + N_B - 2}} $$ The Cohen's d statistic is computed as: $$ d = \frac{\bar{X}B - \bar{X}_A}{s{\text{pooled}}} $$

Standard Classification Heuristics: - $|d| < 0.2$: Negligible effect size. - $0.2 \\le |d| < 0.5$: Small effect size (e.g., most successful digital A/B tests). - $0.5 \\le |d| < 0.8$: Medium effect size. - $|d| \\ge 0.8$: Large effect size (indicates highly impactful, structural interventions).

PARAMETER	DESCRIPTION
`group_a`	1D array of outcomes for Control (Group A). TYPE: `ndarray`
`group_b`	1D array of outcomes for Treatment (Group B). TYPE: `ndarray`

RETURNS	DESCRIPTION
`float`	The calculated Cohen's d statistic. TYPE: `float`

Source code in src\xpyrment\interpret\effect_size.py

def compute_cohens_d(group_a: np.ndarray, group_b: np.ndarray) -> float:
    r"""Computes standard standardized effect size using Cohen's d formula.

    Cohen's d (Cohen, 1988) is a standardized, scale-free effect size measure representing the difference
    between two group means in terms of standard deviation units. While p-values measure the statistical
    evidence against a null hypothesis (and are heavily dependent on sample size), Cohen's d measures the
    *practical magnitude* of the treatment effect, making it comparable across entirely different metrics or experiments.

    Mathematical Formulation:
        Let $N_A$, $N_B$ be sample sizes, let $\\bar{X}_A$, $\\bar{X}_B$ be sample means, and let $s_A^2$, $s_B^2$ be the
        unbiased sample variances of the two experimental groups (Control A and Treatment B respectively).

        The pooled sample standard deviation $s_{\\text{pooled}}$ is defined as:
        $$
        s_{\\text{pooled}} = \\sqrt{\\frac{(N_A - 1)s_A^2 + (N_B - 1)s_B^2}{N_A + N_B - 2}}
        $$
        The Cohen's d statistic is computed as:
        $$
        d = \\frac{\\bar{X}_B - \\bar{X}_A}{s_{\\text{pooled}}}
        $$
    Standard Classification Heuristics:
        - $|d| < 0.2$: Negligible effect size.
        - $0.2 \\le |d| < 0.5$: Small effect size (e.g., most successful digital A/B tests).
        - $0.5 \\le |d| < 0.8$: Medium effect size.
        - $|d| \\ge 0.8$: Large effect size (indicates highly impactful, structural interventions).

    Args:
        group_a (np.ndarray): 1D array of outcomes for Control (Group A).
        group_b (np.ndarray): 1D array of outcomes for Treatment (Group B).

    Returns:
        float: The calculated Cohen's d statistic.
    """
    mean_a, mean_b = np.mean(group_a), np.mean(group_b)
    var_a, var_b = np.var(group_a, ddof=1), np.var(group_b, ddof=1)
    n_a, n_b = len(group_a), len(group_b)

    pooled_std = np.sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b) / (n_a + n_b - 2))
    if pooled_std > 0:
        return (mean_b - mean_a) / pooled_std
    return 0.0

scan_subgroups_for_hte

scan_subgroups_for_hte(
    df: DataFrame,
    treatment_col: str,
    metric_col: str,
    segment_cols: list,
) -> dict

Scans demographics/segments to detect Heterogeneous Treatment Effects (HTE) across cohorts.

The Average Treatment Effect (ATE) can often be misleading if different user subgroups respond in opposite directions. For instance, a feature might increase engagement for new users but severely degrade it for power users. Identifying these Heterogeneous Treatment Effects (HTE) is critical for personalized targeting and risk mitigation.

The Statistical Threat of Naive Subgroup Sweeping

A common mistake is to perform independent t-tests across numerous segments (e.g., checking 20 different countries). Doing so dramatically inflates the probability of false positives due to multiple testing: $$ \text{FWER} = 1 - (1 - \alpha)^g $$ where $g$ is the number of subgroups. If $g=20$ and $\\alpha=0.05$, there is a $64\\%$ chance of detecting a "significant" subgroup effect purely by random chance.

To prevent false discoveries, this module implements a two-stage diagnostic framework: 1. Global Interaction Filtering: Rather than running isolated tests on individual subgroups, we fit an integrated regression model containing an interaction term between the treatment assignment indicator $T$ and the subgroup variable $S$: $$ Y_i = \beta_0 + \beta_1 T_i + \beta_2 S_i + \beta_3 (T_i \times S_i) + \varepsilon_i $$ We only report subgroup-specific effects if the joint interaction coefficient $\\beta_3$ is statistically significant ($p < 0.05$). 2. Causal Partitioning (Advanced): Uses algorithmic techniques (such as Causal Trees or Forests, Wager and Athey 2018) that recursively split the covariate space to maximize the difference in treatment effects between leaves, using sample-splitting to prevent overfitting and ensure honest confidence intervals.

Pseudocode for Subgroup HTE Sweeping

function scan_subgroups_for_hte(DataFrame df, String treatment_col, String metric_col, List segment_cols):
    Initialize hte_results = {}
    For each segment in segment_cols:
        - Fit OLS: metric_col ~ treatment_col * segment
        - Compute F-test for the significance of the interaction term.
        - If interaction p-value < 0.05:
            - Calculate specific treatment lifts and confidence intervals within each level of the segment.
            - Add results to hte_results[segment]
    Return hte_results

PARAMETER	DESCRIPTION
`df`	The experimental dataset. TYPE: `DataFrame`
`treatment_col`	Column containing treatment assignments. TYPE: `str`
`metric_col`	The outcome metric column. TYPE: `str`
`segment_cols`	List of categorical columns representing user segments (e.g. `["platform", "country"]`). TYPE: `list`

RETURNS	DESCRIPTION
`dict`	A dictionary of detected heterogeneous treatment effects, including interaction p-values, segment-specific lifts, and confidence intervals. TYPE: `dict`

Source code in src\xpyrment\interpret\hte.py

def scan_subgroups_for_hte(df: pd.DataFrame, treatment_col: str, metric_col: str, segment_cols: list) -> dict:
    r"""Scans demographics/segments to detect Heterogeneous Treatment Effects (HTE) across cohorts.

    The Average Treatment Effect (ATE) can often be misleading if different user subgroups respond in opposite directions.
    For instance, a feature might increase engagement for new users but severely degrade it for power users.
    Identifying these Heterogeneous Treatment Effects (HTE) is critical for personalized targeting and risk mitigation.

    The Statistical Threat of Naive Subgroup Sweeping:
        A common mistake is to perform independent t-tests across numerous segments (e.g., checking 20 different countries).
        Doing so dramatically inflates the probability of false positives due to multiple testing:
        $$
        \\text{FWER} = 1 - (1 - \\alpha)^g
        $$
        where $g$ is the number of subgroups. If $g=20$ and $\\alpha=0.05$, there is a $64\\%$ chance of detecting a
        "significant" subgroup effect purely by random chance.

    To prevent false discoveries, this module implements a two-stage diagnostic framework:
        1. **Global Interaction Filtering**: Rather than running isolated tests on individual subgroups, we fit an integrated
           regression model containing an interaction term between the treatment assignment indicator $T$ and the subgroup variable $S$:
           $$
           Y_i = \\beta_0 + \\beta_1 T_i + \\beta_2 S_i + \\beta_3 (T_i \\times S_i) + \\varepsilon_i
           $$
           We only report subgroup-specific effects if the joint interaction coefficient $\\beta_3$ is statistically significant ($p < 0.05$).
        2. **Causal Partitioning** (Advanced): Uses algorithmic techniques (such as Causal Trees or Forests, Wager and Athey 2018)
           that recursively split the covariate space to maximize the difference in treatment effects between leaves, using
           sample-splitting to prevent overfitting and ensure honest confidence intervals.

    Pseudocode for Subgroup HTE Sweeping:
        ```text
        function scan_subgroups_for_hte(DataFrame df, String treatment_col, String metric_col, List segment_cols):
            Initialize hte_results = {}
            For each segment in segment_cols:
                - Fit OLS: metric_col ~ treatment_col * segment
                - Compute F-test for the significance of the interaction term.
                - If interaction p-value < 0.05:
                    - Calculate specific treatment lifts and confidence intervals within each level of the segment.
                    - Add results to hte_results[segment]
            Return hte_results
        ```

    Args:
        df (pd.DataFrame): The experimental dataset.
        treatment_col (str): Column containing treatment assignments.
        metric_col (str): The outcome metric column.
        segment_cols (list): List of categorical columns representing user segments (e.g. `["platform", "country"]`).

    Returns:
        dict: A dictionary of detected heterogeneous treatment effects, including interaction p-values,
            segment-specific lifts, and confidence intervals.
    """
    import statsmodels.formula.api as smf
    import numpy as np

    hte_results = {}

    for segment in segment_cols:
        # Fit OLS: Y ~ T * S
        formula = f"{metric_col} ~ {treatment_col} * {segment}"
        try:
            model = smf.ols(formula, data=df).fit()

            # Check interaction p-value (usually the last coefficient if segment is binary/numeric)
            # For categorical segments, we look for any interaction term
            interaction_terms = [c for c in model.pvalues.index if ":" in c]

            significant_interaction = False
            for term in interaction_terms:
                if model.pvalues[term] < 0.05:
                    significant_interaction = True
                    break

            if significant_interaction:
                # Calculate lifts within each level of the segment
                segment_levels = df[segment].unique()
                lifts = {}
                for level in segment_levels:
                    sub_df = df[df[segment] == level]
                    ctrl = sub_df[sub_df[treatment_col] == 0][metric_col]
                    trt = sub_df[sub_df[treatment_col] == 1][metric_col]

                    if len(ctrl) > 1 and len(trt) > 1:
                        mean_ctrl = np.mean(ctrl)
                        mean_trt = np.mean(trt)
                        lift = (mean_trt - mean_ctrl) / mean_ctrl if mean_ctrl != 0 else 0.0
                        lifts[str(level)] = lift

                hte_results[segment] = {
                    "interaction_p_value": min(model.pvalues[interaction_terms]),
                    "subgroup_lifts": lifts
                }
        except Exception:
            continue

    return hte_results

check_practical_significance

check_practical_significance(
    relative_lift: float, min_valuable_effect: float
) -> bool

Verifies if the measured lift satisfies the minimal valuable business effect (MVE).

Online experimentation platforms often have massive sample sizes, which makes them highly powered. As a result, extremely microscopic differences (e.g., a $0.05\\%$ lift in page load times) can yield highly significant p-values ($p < 0.01$). However, such a minor improvement may be practically or economically irrelevant, failing to justify the ongoing maintenance overhead of the new code. This function evaluates whether the observed effect size meets or exceeds a pre-defined practical threshold.

Statistical Significance vs. Practical Significance

Let $\\hat{\\theta}$ be the estimated treatment effect, let $[\\theta_{\\text{lower}}, \\ \\theta_{\\text{upper}}]$ be its confidence interval, and let $\\delta_{\\text{MVE}}$ be the Minimum Valuable Effect (MVE).

Three scenarios can occur when evaluating significance: 1. Statistically Significant and Practically Significant: The null hypothesis is rejected ($p < \\alpha$), and the estimated lift exceeds the MVE: $$ \hat{\theta} \ge \delta_{\text{MVE}} \quad \text{and} \quad p < \alpha $$ (Ideally, to be highly confident, we require the entire confidence interval to exceed the threshold: $\\theta_{\\text{lower}} \\ge \\delta_{\\text{MVE}}$). 2. Statistically Significant but NOT Practically Significant: The null hypothesis is rejected ($p < \\alpha$), but the magnitude is trivial: $$ \hat{\theta} < \delta_{\text{MVE}} \quad \text{and} \quad p < \alpha $$ In this case, the feature should generally be rejected despite its "significant" p-value. 3. Practically Significant but NOT Statistically Significant: The estimated point estimate is large ($\\hat{\\theta} \\ge \\delta_{\\text{MVE}}$), but we fail to reject the null hypothesis ($p \\ge \\alpha$). This indicates an underpowered experiment; the sample size was too small to confirm whether the apparent large effect is real or due to noise.

PARAMETER	DESCRIPTION
`relative_lift`	The estimated relative difference between treatment and control (point estimate). TYPE: `float`
`min_valuable_effect`	The minimum relative lift required to be practically valuable ($\\delta_{\\text{MVE}}$). TYPE: `float`

RETURNS	DESCRIPTION
`bool`	True if the estimated relative lift meets or exceeds the minimum valuable effect threshold. TYPE: `bool`

Source code in src\xpyrment\interpret\significance.py

def check_practical_significance(relative_lift: float, min_valuable_effect: float) -> bool:
    r"""Verifies if the measured lift satisfies the minimal valuable business effect (MVE).

    Online experimentation platforms often have massive sample sizes, which makes them highly powered.
    As a result, extremely microscopic differences (e.g., a $0.05\\%$ lift in page load times) can yield highly
    significant p-values ($p < 0.01$). However, such a minor improvement may be practically or economically
    irrelevant, failing to justify the ongoing maintenance overhead of the new code.
    This function evaluates whether the observed effect size meets or exceeds a pre-defined practical threshold.

    ??? mathbox "Statistical Significance vs. Practical Significance"

        Let $\\hat{\\theta}$ be the estimated treatment effect, let $[\\theta_{\\text{lower}}, \\ \\theta_{\\text{upper}}]$
        be its confidence interval, and let $\\delta_{\\text{MVE}}$ be the Minimum Valuable Effect (MVE).

        Three scenarios can occur when evaluating significance:
        1. **Statistically Significant and Practically Significant**:
           The null hypothesis is rejected ($p < \\alpha$), and the estimated lift exceeds the MVE:
           $$
           \\hat{\\theta} \\ge \\delta_{\\text{MVE}} \\quad \\text{and} \\quad p < \\alpha
           $$
           (Ideally, to be highly confident, we require the entire confidence interval to exceed the threshold: $\\theta_{\\text{lower}} \\ge \\delta_{\\text{MVE}}$).
        2. **Statistically Significant but NOT Practically Significant**:
           The null hypothesis is rejected ($p < \\alpha$), but the magnitude is trivial:
           $$
           \\hat{\\theta} < \\delta_{\\text{MVE}} \\quad \\text{and} \\quad p < \\alpha
           $$
           In this case, the feature should generally be rejected despite its "significant" p-value.
        3. **Practically Significant but NOT Statistically Significant**:
           The estimated point estimate is large ($\\hat{\\theta} \\ge \\delta_{\\text{MVE}}$), but we fail to reject the null hypothesis ($p \\ge \\alpha$).
           This indicates an underpowered experiment; the sample size was too small to confirm whether the apparent large effect is real or due to noise.

    Args:
        relative_lift (float): The estimated relative difference between treatment and control (point estimate).
        min_valuable_effect (float): The minimum relative lift required to be practically valuable ($\\delta_{\\text{MVE}}$).

    Returns:
        bool: True if the estimated relative lift meets or exceeds the minimum valuable effect threshold.
    """
    return relative_lift >= min_valuable_effect