Skip to content

Validate Module

The xpyrment.validate module contains submodules and components for validate.

validate

Experiment diagnostics, sanity checks, and validation engines.

This package houses the diagnostic layer of xpyrment. It provides automated safeguards to validate experiment execution, ensuring that results are not corrupted by assignment imbalances, system bugs, or temporary behavioral anomalies.

Submodules: - srm: Detects Sample Ratio Mismatch (SRM) using Pearson Chi-Square Goodness-of-Fit tests. - aa_test: Simulates A/A tests and validates Type I error rate (\(\alpha\)) uniformity. - balance: Computes Standardized Mean Differences (SMD) to evaluate pre-period covariate balance. - novelty: Identifies novelty and primacy effects using temporal interaction models.

MODULE DESCRIPTION
aa_test

A/A test simulations and false-positive rate validation.

balance

Covariate balance checking and standardized mean differences (SMD).

clean

Unified input cleaning, missing value filtering, and collinearity check safeguards (Block 57).

novelty

Novelty and primacy effect diagnostics using temporal interaction models.

srm

Sample Ratio Mismatch (SRM) validation using Pearson Chi-Square Goodness-of-Fit tests.

FUNCTION DESCRIPTION
run_aa_test_validation

Runs an A/A test validation check, asserting that identical splits exhibit no treatment effect.

check_covariate_balance

Computes Normalized Differences and t-tests to evaluate balance of pre-period covariates.

check_novelty_effects

Detects novelty or primacy effects by tracking treatment effect size evolution over time.

check_srm

Calculates the Chi-square p-value to check for Sample Ratio Mismatch (SRM).

run_aa_test_validation

run_aa_test_validation(
    df: DataFrame,
    treatment_col: str,
    metric_col: str,
    num_simulations: int = 100,
    seed: int = 42,
) -> dict

Runs an A/A test validation check, asserting that identical splits exhibit no treatment effect.

An A/A test compares two groups that receive the exact same experience. The objective is to validate the statistical pipeline and confirm that the empirical false positive rate matches theoretical expectations.

PARAMETER DESCRIPTION
df

The historical control dataset.

TYPE: DataFrame

treatment_col

Column name representing the mock or actual assignments.

TYPE: str

metric_col

Column name containing the numeric values under test.

TYPE: str

num_simulations

Number of permutation splits to simulate. Defaults to 100.

TYPE: int DEFAULT: 100

seed

Seed for random generator to guarantee reproducibility. Defaults to 42.

TYPE: int DEFAULT: 42

RETURNS DESCRIPTION
dict

A dictionary containing: - ks_pvalue: The Kolmogorov-Smirnov test p-value indicating goodness-of-fit to a Uniform(0, 1) distribution. - empirical_alpha_05: The raw empirical rejection rate at alpha=0.05. - fdr_alpha_05: The rejection rate after applying Benjamini-Hochberg False Discovery Rate control.

TYPE: dict

Source code in src\xpyrment\validate\aa_test.py
def run_aa_test_validation(
    df: pd.DataFrame,
    treatment_col: str,
    metric_col: str,
    num_simulations: int = 100,
    seed: int = 42
) -> dict:
    r"""Runs an A/A test validation check, asserting that identical splits exhibit no treatment effect.

    An A/A test compares two groups that receive the exact same experience. The objective is to validate
    the statistical pipeline and confirm that the empirical false positive rate matches theoretical expectations.

    Args:
        df (pd.DataFrame): The historical control dataset.
        treatment_col (str): Column name representing the mock or actual assignments.
        metric_col (str): Column name containing the numeric values under test.
        num_simulations (int): Number of permutation splits to simulate. Defaults to 100.
        seed (int): Seed for random generator to guarantee reproducibility. Defaults to 42.

    Returns:
        dict: A dictionary containing:
            - ks_pvalue: The Kolmogorov-Smirnov test p-value indicating goodness-of-fit to a Uniform(0, 1) distribution.
            - empirical_alpha_05: The raw empirical rejection rate at alpha=0.05.
            - fdr_alpha_05: The rejection rate after applying Benjamini-Hochberg False Discovery Rate control.
    """
    import numpy as np
    from scipy import stats

    rng = np.random.default_rng(seed)

    p_values = []

    # Clean the metric array to avoid NaNs interfering
    clean_df = df[[treatment_col, metric_col]].dropna()
    if len(clean_df) < 4:
        return {"ks_pvalue": 1.0, "empirical_alpha_05": 0.0, "fdr_alpha_05": 0.0}

    treatment_vals = clean_df[treatment_col].values
    metric_vals = clean_df[metric_col].values

    unique_vals = np.unique(treatment_vals)
    if len(unique_vals) < 2:
        raise ValueError(f"A/A test requires at least 2 distinct groups in '{treatment_col}'. Found {len(unique_vals)}.")

    n1 = int(np.sum(treatment_vals == unique_vals[0]))
    n2 = len(treatment_vals) - n1
    N = len(treatment_vals)

    chunk_size = 5000
    for i in range(0, num_simulations, chunk_size):
        current_chunk = min(chunk_size, num_simulations - i)

        # Generate random permutations using argsort of random uniform
        rand_idx = rng.random((current_chunk, N)).argsort(axis=1)

        idx1 = rand_idx[:, :n1]
        idx2 = rand_idx[:, n1:]

        A1 = metric_vals[idx1] 
        A2 = metric_vals[idx2] 

        mean1 = A1.mean(axis=1)
        mean2 = A2.mean(axis=1)

        var1 = A1.var(axis=1, ddof=1)
        var2 = A2.var(axis=1, ddof=1)

        vn1 = var1 / n1
        vn2 = var2 / n2

        with np.errstate(divide='ignore', invalid='ignore'):
            t_stat = (mean2 - mean1) / np.sqrt(vn1 + vn2)
            df_stat = (vn1 + vn2)**2 / ( (vn1**2)/(n1-1) + (vn2**2)/(n2-1) )

            p_vals = 2 * stats.t.sf(np.abs(t_stat), df_stat)

        p_vals = np.nan_to_num(p_vals, nan=1.0)
        p_values.extend(p_vals)

    # Perform Kolmogorov-Smirnov goodness-of-fit test against a continuous Uniform(0, 1) CDF
    ks_res = stats.kstest(p_values, "uniform")

    # False Discovery Rate (FDR) control using Benjamini-Hochberg
    p_values = np.array(p_values)
    empirical_alpha = np.mean(p_values < 0.05)
    fdr_pvals = stats.false_discovery_control(p_values)
    fdr_alpha = np.mean(fdr_pvals < 0.05)

    return {
        "ks_pvalue": float(ks_res.pvalue),
        "empirical_alpha_05": float(empirical_alpha),
        "fdr_alpha_05": float(fdr_alpha)
    }

check_covariate_balance

check_covariate_balance(
    df: DataFrame, treatment_col: str, covariate_cols: list
) -> dict

Computes Normalized Differences and t-tests to evaluate balance of pre-period covariates.

Verifies that pre-period characteristics are distributed symmetrically across treatment arms. While simple t-tests can be used, they are highly sensitive in online datasets: with large footprints, extremely tiny, practically negligible differences will yield highly significant p-values (\(p < 0.05\)). Therefore, we compute Standardized Mean Differences (SMD) as the primary effect size metric.

Mathematical Representation
  1. Standardized Mean Difference (SMD) for continuous covariates: Let \(\bar{X}_T\) and \(\bar{X}_C\) be the sample means of a covariate \(X\) in the treatment and control groups, and let \(s_T^2\) and \(s_C^2\) be their sample variances. $$ \text{SMD} = \frac{\bar{X}_T - \bar{X}_C}{\sqrt{\frac{s_T^2 + s_C^2}{2}}} $$
  2. Pearson Chi-Square Test for Independence for categorical covariates: Evaluates whether the proportion of units in each category is independent of treatment.
PARAMETER DESCRIPTION
df

The experimental dataset containing units, treatment assignments, and covariates.

TYPE: DataFrame

treatment_col

Column name identifying experimental groups/arms.

TYPE: str

covariate_cols

List of column names representing categorical or continuous pre-experiment covariates.

TYPE: list

RETURNS DESCRIPTION
dict

A dictionary mapping each covariate name to a diagnostic sub-dictionary containing SMD, p-values, and balance classification tags.

TYPE: dict

Source code in src\xpyrment\validate\balance.py
def check_covariate_balance(df: pd.DataFrame, treatment_col: str, covariate_cols: list) -> dict:
    r"""Computes Normalized Differences and t-tests to evaluate balance of pre-period covariates.

    Verifies that pre-period characteristics are distributed symmetrically across treatment arms.
    While simple t-tests can be used, they are highly sensitive in online datasets: with large footprints,
    extremely tiny, practically negligible differences will yield highly significant p-values ($p < 0.05$).
    Therefore, we compute **Standardized Mean Differences (SMD)** as the primary effect size metric.

    ??? mathbox "Mathematical Representation"

        1. **Standardized Mean Difference (SMD)** for continuous covariates:
           Let $\bar{X}_T$ and $\bar{X}_C$ be the sample means of a covariate $X$ in the treatment and control groups,
           and let $s_T^2$ and $s_C^2$ be their sample variances.
           $$
           \text{SMD} = \frac{\bar{X}_T - \bar{X}_C}{\sqrt{\frac{s_T^2 + s_C^2}{2}}}
           $$
        2. **Pearson Chi-Square Test for Independence** for categorical covariates:
           Evaluates whether the proportion of units in each category is independent of treatment.

    Args:
        df (pd.DataFrame): The experimental dataset containing units, treatment assignments, and covariates.
        treatment_col (str): Column name identifying experimental groups/arms.
        covariate_cols (list): List of column names representing categorical or continuous pre-experiment covariates.

    Returns:
        dict: A dictionary mapping each covariate name to a diagnostic sub-dictionary containing SMD, p-values,
            and balance classification tags.
    """
    import numpy as np
    from scipy import stats

    groups = df[treatment_col].unique()
    if len(groups) < 2:
        raise ValueError(f"Balance check requires at least 2 distinct groups in '{treatment_col}'. Found {len(groups)}.")

    # Sort groups to be deterministic: first group is control (group 0), second is treatment (group 1)
    groups = sorted(groups)
    grp_0 = df[df[treatment_col] == groups[0]]
    grp_1 = df[df[treatment_col] == groups[1]]

    results = {}

    for cov in covariate_cols:
        if cov not in df.columns:
            raise KeyError(f"Covariate column '{cov}' not found in DataFrame.")

        # Determine type: check if column is numeric
        if pd.api.types.is_numeric_dtype(df[cov]):
            val_0 = grp_0[cov].dropna()
            val_1 = grp_1[cov].dropna()

            mean_0 = val_0.mean()
            mean_1 = val_1.mean()
            var_0 = val_0.var(ddof=1)
            var_1 = val_1.var(ddof=1)

            # Compute Standardized Mean Difference (SMD)
            pooled_sd = np.sqrt((var_0 + var_1) / 2.0)
            if pooled_sd == 0.0:
                smd = 0.0
            else:
                smd = (mean_1 - mean_0) / pooled_sd

            # Welch's t-test (unequal variances assumed)
            if len(val_0) > 0 and len(val_1) > 0:
                _, p_val = stats.ttest_ind(val_1, val_0, equal_var=False)
                # Kolmogorov-Smirnov test for distribution shape alignment
                ks_stat, ks_p = stats.ks_2samp(val_1, val_0)
            else:
                p_val = 1.0
                ks_stat, ks_p = 0.0, 1.0

            results[cov] = {
                "type": "numeric",
                "smd": float(smd),
                "p_value": float(p_val),
                "ks_statistic": float(ks_stat),
                "ks_p_value": float(ks_p)
            }
        else:
            # Categorical covariate: build crosstab contingency table
            contingency_table = pd.crosstab(df[cov], df[treatment_col])

            if contingency_table.shape[0] > 0 and contingency_table.shape[1] > 0:
                # Pearson's chi-square test of independence
                chi2_res = stats.chi2_contingency(contingency_table)
                p_val = chi2_res.pvalue
            else:
                p_val = 1.0

            results[cov] = {
                "type": "categorical",
                "p_value": float(p_val)
            }

    # Integrate Mahalanobis distance multivariate covariance balance tests
    numeric_covs = [cov for cov in covariate_cols if pd.api.types.is_numeric_dtype(df[cov])]
    if len(numeric_covs) >= 1:
        data_0 = grp_0[numeric_covs].dropna()
        data_1 = grp_1[numeric_covs].dropna()

        if len(data_0) > len(numeric_covs) and len(data_1) > len(numeric_covs):
            mean_0 = data_0.mean().values
            mean_1 = data_1.mean().values

            # Pooled covariance matrix
            cov_0 = data_0.cov().values
            cov_1 = data_1.cov().values
            pooled_cov = (cov_0 + cov_1) / 2.0

            try:
                inv_pooled_cov = np.linalg.pinv(pooled_cov)
                diff = mean_1 - mean_0
                mahalanobis_dist = np.sqrt(np.dot(np.dot(diff, inv_pooled_cov), diff))
                results["_multivariate"] = {
                    "mahalanobis_distance": float(mahalanobis_dist),
                    "n_covariates": len(numeric_covs)
                }
            except np.linalg.LinAlgError:
                pass

    return results

check_novelty_effects

check_novelty_effects(
    df: DataFrame,
    treatment_col: str,
    metric_col: str,
    time_col: str,
) -> dict

Detects novelty or primacy effects by tracking treatment effect size evolution over time.

In online user testing, two common behavioral biases can distort short-term results: - Novelty Effect: Users are initially drawn to a redesigned feature, leading to a temporary surge in engagement that decays back to baseline. - Primacy (or Learning) Effect: Users are initially slowed down, causing a temporary dip in conversion that recovers once they adapt to the change.

Mathematical Representation and Regression Detection

We fit an ordinary least squares (OLS) regression model with an interaction term between treatment \(T_i \in \{0, 1\}\) and elapsed time \(t_i\): $$ Y_i = \beta_0 + \beta_1 T_i + \beta_2 t_i + \beta_3 (T_i \times t_i) + \varepsilon_i $$

Args: df (pd.DataFrame): The experimental dataset. treatment_col (str): Column name identifying experimental groups/arms. metric_col (str): Column containing the evaluated metric (continuous or rates). time_col (str): Column name representing the timestamp or elapsed date index.

RETURNS DESCRIPTION
dict

A dictionary containing estimated interaction coefficients, standard errors, p-values, and behavioral bias classifications.

TYPE: dict

Source code in src\xpyrment\validate\novelty.py
def check_novelty_effects(df: pd.DataFrame, treatment_col: str, metric_col: str, time_col: str) -> dict:
    r"""Detects novelty or primacy effects by tracking treatment effect size evolution over time.

    In online user testing, two common behavioral biases can distort short-term results:
    - **Novelty Effect**: Users are initially drawn to a redesigned feature, leading to a temporary
      surge in engagement that decays back to baseline.
    - **Primacy (or Learning) Effect**: Users are initially slowed down, causing a temporary dip
      in conversion that recovers once they adapt to the change.

    ??? mathbox "Mathematical Representation and Regression Detection"

        We fit an ordinary least squares (OLS) regression model with an interaction term
        between treatment $T_i \in \{0, 1\}$ and elapsed time $t_i$:
        $$
        Y_i = \beta_0 + \beta_1 T_i + \beta_2 t_i + \beta_3 (T_i \times t_i) + \varepsilon_i
        $$
    Args:
        df (pd.DataFrame): The experimental dataset.
        treatment_col (str): Column name identifying experimental groups/arms.
        metric_col (str): Column containing the evaluated metric (continuous or rates).
        time_col (str): Column name representing the timestamp or elapsed date index.

    Returns:
        dict: A dictionary containing estimated interaction coefficients, standard errors, p-values,
            and behavioral bias classifications.
    """
    import numpy as np
    from scipy import stats

    clean_df = df[[treatment_col, metric_col, time_col]].dropna().copy()
    n = len(clean_df)
    if n < 5:
        raise ValueError("At least 5 samples are required to fit the OLS novelty interaction model.")

    # Convert treatment_col to binary indicator (0 = control, 1 = treatment)
    groups = sorted(clean_df[treatment_col].unique())
    if len(groups) < 2:
        raise ValueError("At least 2 groups are required in treatment_col.")

    clean_df["T"] = clean_df[treatment_col].map({groups[0]: 0.0, groups[1]: 1.0}).fillna(0.0)

    # Convert time_col to numerical index
    if pd.api.types.is_datetime64_any_dtype(clean_df[time_col]):
        t_min = clean_df[time_col].min()
        clean_df["t"] = (clean_df[time_col] - t_min).dt.total_seconds() / (24 * 3600)
    else:
        clean_df["t"] = clean_df[time_col].astype(float)

    # Compute interaction column
    clean_df["T_x_t"] = clean_df["T"] * clean_df["t"]

    # Build OLS design matrix X and response y
    X = np.column_stack([
        np.ones(n),
        clean_df["T"].values,
        clean_df["t"].values,
        clean_df["T_x_t"].values
    ])
    y = clean_df[metric_col].values

    # Fit OLS: beta = (X.T X)^-1 X.T y
    XTX = np.dot(X.T, X)
    try:
        inv_XTX = np.linalg.inv(XTX)
    except np.linalg.LinAlgError:
        raise ValueError(
            "Design matrix is singular. Ensure there is variance in treatment, time, and their interaction."
        )

    beta = np.dot(inv_XTX, np.dot(X.T, y))

    # Calculate residuals, residual variance, and standard errors of coefficients
    residuals = y - np.dot(X, beta)
    rss = np.sum(residuals**2)
    p_params = 4

    sigma_sq = rss / (n - p_params)
    cov_beta = sigma_sq * inv_XTX
    se = np.sqrt(np.diag(cov_beta))

    # Compute t-statistics and two-tailed p-values
    t_stats = beta / se
    p_vals = 2 * (1.0 - stats.t.cdf(np.abs(t_stats), df=n - p_params))

    beta_0, beta_1, beta_2, beta_3 = beta
    se_0, se_1, se_2, se_3 = se
    p_0, p_1, p_2, p_3 = p_vals

    classification = "Stable Treatment Effect"
    if p_3 < 0.05:
        if beta_1 > 0.0 and beta_3 < 0.0:
            classification = "Novelty Effect Detected"
        elif beta_1 < 0.0 and beta_3 > 0.0:
            classification = "Primacy Effect Detected"

    return {
        "intercept": {"coef": float(beta_0), "std_err": float(se_0), "p_value": float(p_0)},
        "treatment": {"coef": float(beta_1), "std_err": float(se_1), "p_value": float(p_1)},
        "time": {"coef": float(beta_2), "std_err": float(se_2), "p_value": float(p_2)},
        "interaction": {"coef": float(beta_3), "std_err": float(se_3), "p_value": float(p_3)},
        "classification": classification
    }

check_srm

check_srm(
    observed_counts: List[int], expected_ratios: List[float]
) -> float

Calculates the Chi-square p-value to check for Sample Ratio Mismatch (SRM).

Sample Ratio Mismatch (SRM) is one of the most critical diagnostic flags in web and system experimentation. It indicates that the observed sample allocation counts deviate from the planned/designed allocation ratios. This method performs a Pearson Chi-square goodness-of-fit test to determine whether the observed counts are statistically compatible with the expected ratios.

Mathematical Formulation

Let \(k\) be the number of variants, let \(O_i\) be the observed count of units in variant \(i\) (\(i \in \{1, \dots, k\}\)), and let \(r_i\) be the planned allocation ratio for variant \(i\). The total observed sample size is: $$ N = \sum_{i=1}^{k} O_i $$ The expected sample count \(E_i\) for variant \(i\) is calculated as: $$ E_i = N \times \frac{r_i}{\sum_{j=1}^{k} r_j} $$ The Pearson Chi-square test statistic is computed as: $$ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} $$ Under the null hypothesis \(H_0\) (there is no SRM, and the assignment mechanism is unbiased): $$ \chi^2 \sim \chi^2_{k-1} $$ where \(k-1\) is the degrees of freedom of the distribution. The p-value is calculated as: $$ p = 1 - F_{\chi^2_{k-1}}(\chi^2_{\text{calc}}) $$ where \(F\) is the cumulative distribution function of the Chi-square distribution.

Interpretation Threshold
  • If \(p < 0.001\) (\(0.1\%\) significance): The null hypothesis of perfect assignment is rejected. An SRM is highly likely, signaling a telemetry or system bug that invalidates downstream causal inferences.
  • Common causes of SRM: browser-specific treatment crashes, asymmetric page-redirection delays, bot filters interacting with treatment flags, or mid-experiment changes in allocation rates.
PARAMETER DESCRIPTION
observed_counts

The actual recorded sample sizes allocated to each variant (e.g., [50122, 49878]).

TYPE: List[int]

expected_ratios

The target allocation proportions or relative weights (e.g., [0.5, 0.5]).

TYPE: List[float]

RETURNS DESCRIPTION
float

The calculated p-value of the goodness-of-fit test.

TYPE: float

RAISES DESCRIPTION
SRMError

If the computed p-value is strictly less than 0.001, indicating a severe, non-random mismatch.

Examples:

Example
>>> # Perfectly fine allocation (p ~ 0.81)
>>> check_srm([5012, 4988], [0.5, 0.5])
0.8096180371302821
>>> # Severe mismatch (triggers SRMError)
>>> try:
...     check_srm([4500, 5500], [0.5, 0.5])
... except SRMError as e:
...     print("Error detected!")
Error detected!
Source code in src\xpyrment\validate\srm.py
def check_srm(observed_counts: List[int], expected_ratios: List[float]) -> float:
    r"""Calculates the Chi-square p-value to check for Sample Ratio Mismatch (SRM).

    Sample Ratio Mismatch (SRM) is one of the most critical diagnostic flags in web and system experimentation.
    It indicates that the observed sample allocation counts deviate from the planned/designed allocation ratios.
    This method performs a Pearson Chi-square goodness-of-fit test to determine whether the observed counts
    are statistically compatible with the expected ratios.

    ??? mathbox "Mathematical Formulation"

        Let $k$ be the number of variants, let $O_i$ be the observed count of units in variant $i$ ($i \in \{1, \dots, k\}$),
        and let $r_i$ be the planned allocation ratio for variant $i$.
        The total observed sample size is:
        $$
        N = \sum_{i=1}^{k} O_i
        $$
        The expected sample count $E_i$ for variant $i$ is calculated as:
        $$
        E_i = N \times \frac{r_i}{\sum_{j=1}^{k} r_j}
        $$
        The Pearson Chi-square test statistic is computed as:
        $$
        \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}
        $$
        Under the null hypothesis $H_0$ (there is no SRM, and the assignment mechanism is unbiased):
        $$
        \chi^2 \sim \chi^2_{k-1}
        $$
        where $k-1$ is the degrees of freedom of the distribution. The p-value is calculated as:
        $$
        p = 1 - F_{\chi^2_{k-1}}(\chi^2_{\text{calc}})
        $$
        where $F$ is the cumulative distribution function of the Chi-square distribution.

    ### Interpretation Threshold

    - If $p < 0.001$ ($0.1\%$ significance): The null hypothesis of perfect assignment is rejected. An SRM is
      highly likely, signaling a telemetry or system bug that invalidates downstream causal inferences.
    - Common causes of SRM: browser-specific treatment crashes, asymmetric page-redirection delays, bot filters
      interacting with treatment flags, or mid-experiment changes in allocation rates.

    Args:
        observed_counts (List[int]): The actual recorded sample sizes allocated to each variant (e.g., `[50122, 49878]`).
        expected_ratios (List[float]): The target allocation proportions or relative weights (e.g., `[0.5, 0.5]`).

    Returns:
        float: The calculated p-value of the goodness-of-fit test.

    Raises:
        SRMError: If the computed p-value is strictly less than 0.001, indicating a severe, non-random mismatch.

    Examples:
        ??? example "Example"

            ```python
            >>> # Perfectly fine allocation (p ~ 0.81)
            >>> check_srm([5012, 4988], [0.5, 0.5])
            0.8096180371302821
            >>> # Severe mismatch (triggers SRMError)
            >>> try:
            ...     check_srm([4500, 5500], [0.5, 0.5])
            ... except SRMError as e:
            ...     print("Error detected!")
            Error detected!
            ```
    """
    if len(observed_counts) != len(expected_ratios):
        raise ValueError("Length of observed_counts and expected_ratios must be equal.")

    if any(c < 0 for c in observed_counts):
        raise ValueError("All elements in observed_counts must be non-negative.")

    if any(r <= 0 for r in expected_ratios):
        raise ValueError("All elements in expected_ratios must be strictly positive.")

    if any(not math.isfinite(float(c)) for c in observed_counts):
        raise ValueError("All elements in observed_counts must be finite (no NaN or infinity).")

    if any(not math.isfinite(float(r)) for r in expected_ratios):
        raise ValueError("All elements in expected_ratios must be finite (no NaN or infinity).")

    total_observed = sum(observed_counts)
    if total_observed == 0:
        logger.warning("SRM check bypassed: total observed counts is 0.")
        return 1.0

    sum_ratios = sum(expected_ratios)
    expected_counts = [ratio * total_observed / sum_ratios for ratio in expected_ratios]

    if any(e < 5 for e in expected_counts):
        logger.warning("SRM chi-square approximation may be invalid because some expected counts are < 5.")

    # Perform chi-square goodness-of-fit test
    _, p_value = stats.chisquare(f_obs=observed_counts, f_exp=expected_counts)

    if p_value < 0.001:
        raise SRMError(
            f"Sample Ratio Mismatch detected (p={p_value:.4e}). "
            f"Observed counts: {observed_counts}, Expected ratios: {expected_ratios}"
        )

    return p_value