Regression

regression

Interactive regression modeling and Likelihood Ratio Testing (LRT).

This module provides functions to test covariate-treatment interactions by fitting nested linear regression models and performing Likelihood Ratio Tests.

FUNCTION	DESCRIPTION
`check_treatment_covariate_interaction`	Computes a Likelihood Ratio Test (LRT) to check if a covariate significantly interacts with the treatment split.

check_treatment_covariate_interaction

check_treatment_covariate_interaction(
    df: DataFrame,
    treatment_col: str,
    covariate_col: str,
    target_col: str,
) -> float

Computes a Likelihood Ratio Test (LRT) to check if a covariate significantly interacts with the treatment split.

Evaluates whether the treatment effect varies across different values of a pre-period covariate. To determine if the interaction term is statistically necessary (rather than just overfitting the sample), we fit nested regression models and perform a classical Likelihood Ratio Test.

Mathematical Formulation of Nested Models

We define two models representing competing hypotheses: 1. Restricted Null Model ($M_{\\text{null}}$) (additive, assuming no interaction): $$ Y_i = \beta_0 + \beta_1 T_i + \beta_2 C_i + \varepsilon_i $$ 2. Unrestricted Alternative Model ($M_{\\text{alt}}$) (interactive, assuming interaction): $$ Y_i = \beta_0 + \beta_1 T_i + \beta_2 C_i + \beta_3 (T_i \times C_i) + \varepsilon_i $$ where: - $Y_i$: The target outcome metric ($target\\_col$) for unit $i$. - $T_i$: The treatment group indicator ($treatment\\_col$, e.g., $0$ or $1$). - $C_i$: The pre-period covariate ($covariate\\_col$, e.g., device type or baseline revenue). - $T_i \\times C_i$: The interaction/product term.

The Likelihood Ratio Test (LRT): Let $\\ln L(M_{\\text{null}})$ and $\\ln L(M_{\\text{alt}})$ be the maximized log-likelihood values of the nested models. The test statistic $D$ is computed as: $$ D = 2 \left( \ln L(M_{\text{alt}}) - \ln L(M_{\text{null}}) \right) $$ Under the null hypothesis $H_0: \\beta_3 = 0$ (no interaction), the test statistic $D$ asymptotically follows a Chi-square distribution with degrees of freedom equal to the difference in the number of parameters: $$ D \sim \chi^2_{df_{\text{alt}} - df_{\text{null}}} = \chi^2_1 $$ (since we added exactly one interaction parameter, $\\beta_3$).

The resulting p-value is calculated as:
$$
p = 1 - F_{\\chi^2_1}(D)
$$
where $F$ is the cumulative distribution function of the Chi-square distribution with 1 degree of freedom.
If $p < 0.05$, the alternative model is selected, confirming that the treatment effect varies across levels of the covariate.

PARAMETER	DESCRIPTION
`df`	The experimental dataset. TYPE: `DataFrame`
`treatment_col`	Column containing treatment assignments. TYPE: `str`
`covariate_col`	Column containing the target pre-period covariate under evaluation. TYPE: `str`
`target_col`	Column containing the outcome response variable ($Y$). TYPE: `str`

RETURNS	DESCRIPTION
`float`	The calculated p-value of the Likelihood Ratio Test. A value $< 0.05$ indicates a significant interaction. TYPE: `float`

Source code in src\xpyrment\interactions\regression.py

def check_treatment_covariate_interaction(df: pd.DataFrame, treatment_col: str, covariate_col: str, target_col: str) -> float:
    r"""Computes a Likelihood Ratio Test (LRT) to check if a covariate significantly interacts with the treatment split.

    Evaluates whether the treatment effect varies across different values of a pre-period covariate.
    To determine if the interaction term is statistically necessary (rather than just overfitting the sample),
    we fit nested regression models and perform a classical Likelihood Ratio Test.

    Mathematical Formulation of Nested Models:
        We define two models representing competing hypotheses:
        1. **Restricted Null Model ($M_{\\text{null}}$)** (additive, assuming no interaction):
           $$
           Y_i = \\beta_0 + \\beta_1 T_i + \\beta_2 C_i + \\varepsilon_i
           $$
        2. **Unrestricted Alternative Model ($M_{\\text{alt}}$)** (interactive, assuming interaction):
           $$
           Y_i = \\beta_0 + \\beta_1 T_i + \\beta_2 C_i + \\beta_3 (T_i \\times C_i) + \\varepsilon_i
           $$
        where:
        - $Y_i$: The target outcome metric ($target\\_col$) for unit $i$.
        - $T_i$: The treatment group indicator ($treatment\\_col$, e.g., $0$ or $1$).
        - $C_i$: The pre-period covariate ($covariate\\_col$, e.g., device type or baseline revenue).
        - $T_i \\times C_i$: The interaction/product term.

    The Likelihood Ratio Test (LRT):
        Let $\\ln L(M_{\\text{null}})$ and $\\ln L(M_{\\text{alt}})$ be the maximized log-likelihood values of the nested models.
        The test statistic $D$ is computed as:
        $$
        D = 2 \\left( \\ln L(M_{\\text{alt}}) - \\ln L(M_{\\text{null}}) \\right)
        $$
        Under the null hypothesis $H_0: \\beta_3 = 0$ (no interaction), the test statistic $D$ asymptotically follows a
        Chi-square distribution with degrees of freedom equal to the difference in the number of parameters:
        $$
        D \\sim \\chi^2_{df_{\\text{alt}} - df_{\\text{null}}} = \\chi^2_1
        $$
        (since we added exactly one interaction parameter, $\\beta_3$).

        The resulting p-value is calculated as:
        $$
        p = 1 - F_{\\chi^2_1}(D)
        $$
        where $F$ is the cumulative distribution function of the Chi-square distribution with 1 degree of freedom.
        If $p < 0.05$, the alternative model is selected, confirming that the treatment effect varies across levels of the covariate.

    Args:
        df (pd.DataFrame): The experimental dataset.
        treatment_col (str): Column containing treatment assignments.
        covariate_col (str): Column containing the target pre-period covariate under evaluation.
        target_col (str): Column containing the outcome response variable ($Y$).

    Returns:
        float: The calculated p-value of the Likelihood Ratio Test. A value $< 0.05$ indicates a significant interaction.
    """
    import statsmodels.formula.api as smf
    from scipy.stats import chi2

    # Model 1: Additive (Restricted)
    formula_null = f"{target_col} ~ {treatment_col} + {covariate_col}"
    model_null = smf.ols(formula_null, data=df).fit()

    # Model 2: Interactive (Unrestricted)
    formula_alt = f"{target_col} ~ {treatment_col} * {covariate_col}"
    model_alt = smf.ols(formula_alt, data=df).fit()

    # Likelihood Ratio Test
    # For OLS, LRT = n * ln(RSS_null / RSS_alt)
    # But statsmodels OLS results have .llf (log-likelihood)
    llf_null = model_null.llf
    llf_alt = model_alt.llf

    lr_stat = 2 * (llf_alt - llf_null)
    p_value = chi2.sf(lr_stat, df=1)

    return float(p_value)