Validate Module
The xpyrment.validate module contains submodules and components for validate.
validate
Experiment diagnostics, sanity checks, and validation engines.
This package houses the diagnostic layer of xpyrment. It provides automated safeguards to
validate experiment execution, ensuring that results are not corrupted by assignment imbalances,
system bugs, or temporary behavioral anomalies.
Submodules:
- srm: Detects Sample Ratio Mismatch (SRM) using Pearson Chi-Square Goodness-of-Fit tests.
- aa_test: Simulates A/A tests and validates Type I error rate (\(\alpha\)) uniformity.
- balance: Computes Standardized Mean Differences (SMD) to evaluate pre-period covariate balance.
- novelty: Identifies novelty and primacy effects using temporal interaction models.
| MODULE | DESCRIPTION |
|---|---|
aa_test |
A/A test simulations and false-positive rate validation. |
balance |
Covariate balance checking and standardized mean differences (SMD). |
clean |
Unified input cleaning, missing value filtering, and collinearity check safeguards (Block 57). |
novelty |
Novelty and primacy effect diagnostics using temporal interaction models. |
srm |
Sample Ratio Mismatch (SRM) validation using Pearson Chi-Square Goodness-of-Fit tests. |
| FUNCTION | DESCRIPTION |
|---|---|
run_aa_test_validation |
Runs an A/A test validation check, asserting that identical splits exhibit no treatment effect. |
check_covariate_balance |
Computes Normalized Differences and t-tests to evaluate balance of pre-period covariates. |
check_novelty_effects |
Detects novelty or primacy effects by tracking treatment effect size evolution over time. |
check_srm |
Calculates the Chi-square p-value to check for Sample Ratio Mismatch (SRM). |
run_aa_test_validation
run_aa_test_validation(
df: DataFrame,
treatment_col: str,
metric_col: str,
num_simulations: int = 100,
seed: int = 42,
) -> dict
Runs an A/A test validation check, asserting that identical splits exhibit no treatment effect.
An A/A test compares two groups that receive the exact same experience. The objective is to validate the statistical pipeline and confirm that the empirical false positive rate matches theoretical expectations.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
The historical control dataset.
TYPE:
|
treatment_col
|
Column name representing the mock or actual assignments.
TYPE:
|
metric_col
|
Column name containing the numeric values under test.
TYPE:
|
num_simulations
|
Number of permutation splits to simulate. Defaults to 100.
TYPE:
|
seed
|
Seed for random generator to guarantee reproducibility. Defaults to 42.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
A dictionary containing: - ks_pvalue: The Kolmogorov-Smirnov test p-value indicating goodness-of-fit to a Uniform(0, 1) distribution. - empirical_alpha_05: The raw empirical rejection rate at alpha=0.05. - fdr_alpha_05: The rejection rate after applying Benjamini-Hochberg False Discovery Rate control.
TYPE:
|
Source code in src\xpyrment\validate\aa_test.py
check_covariate_balance
Computes Normalized Differences and t-tests to evaluate balance of pre-period covariates.
Verifies that pre-period characteristics are distributed symmetrically across treatment arms. While simple t-tests can be used, they are highly sensitive in online datasets: with large footprints, extremely tiny, practically negligible differences will yield highly significant p-values (\(p < 0.05\)). Therefore, we compute Standardized Mean Differences (SMD) as the primary effect size metric.
Mathematical Representation
- Standardized Mean Difference (SMD) for continuous covariates: Let \(\bar{X}_T\) and \(\bar{X}_C\) be the sample means of a covariate \(X\) in the treatment and control groups, and let \(s_T^2\) and \(s_C^2\) be their sample variances. $$ \text{SMD} = \frac{\bar{X}_T - \bar{X}_C}{\sqrt{\frac{s_T^2 + s_C^2}{2}}} $$
- Pearson Chi-Square Test for Independence for categorical covariates: Evaluates whether the proportion of units in each category is independent of treatment.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
The experimental dataset containing units, treatment assignments, and covariates.
TYPE:
|
treatment_col
|
Column name identifying experimental groups/arms.
TYPE:
|
covariate_cols
|
List of column names representing categorical or continuous pre-experiment covariates.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
A dictionary mapping each covariate name to a diagnostic sub-dictionary containing SMD, p-values, and balance classification tags.
TYPE:
|
Source code in src\xpyrment\validate\balance.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
check_novelty_effects
Detects novelty or primacy effects by tracking treatment effect size evolution over time.
In online user testing, two common behavioral biases can distort short-term results: - Novelty Effect: Users are initially drawn to a redesigned feature, leading to a temporary surge in engagement that decays back to baseline. - Primacy (or Learning) Effect: Users are initially slowed down, causing a temporary dip in conversion that recovers once they adapt to the change.
Mathematical Representation and Regression Detection
We fit an ordinary least squares (OLS) regression model with an interaction term between treatment \(T_i \in \{0, 1\}\) and elapsed time \(t_i\): $$ Y_i = \beta_0 + \beta_1 T_i + \beta_2 t_i + \beta_3 (T_i \times t_i) + \varepsilon_i $$
Args: df (pd.DataFrame): The experimental dataset. treatment_col (str): Column name identifying experimental groups/arms. metric_col (str): Column containing the evaluated metric (continuous or rates). time_col (str): Column name representing the timestamp or elapsed date index.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
A dictionary containing estimated interaction coefficients, standard errors, p-values, and behavioral bias classifications.
TYPE:
|
Source code in src\xpyrment\validate\novelty.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
check_srm
Calculates the Chi-square p-value to check for Sample Ratio Mismatch (SRM).
Sample Ratio Mismatch (SRM) is one of the most critical diagnostic flags in web and system experimentation. It indicates that the observed sample allocation counts deviate from the planned/designed allocation ratios. This method performs a Pearson Chi-square goodness-of-fit test to determine whether the observed counts are statistically compatible with the expected ratios.
Mathematical Formulation
Let \(k\) be the number of variants, let \(O_i\) be the observed count of units in variant \(i\) (\(i \in \{1, \dots, k\}\)), and let \(r_i\) be the planned allocation ratio for variant \(i\). The total observed sample size is: $$ N = \sum_{i=1}^{k} O_i $$ The expected sample count \(E_i\) for variant \(i\) is calculated as: $$ E_i = N \times \frac{r_i}{\sum_{j=1}^{k} r_j} $$ The Pearson Chi-square test statistic is computed as: $$ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} $$ Under the null hypothesis \(H_0\) (there is no SRM, and the assignment mechanism is unbiased): $$ \chi^2 \sim \chi^2_{k-1} $$ where \(k-1\) is the degrees of freedom of the distribution. The p-value is calculated as: $$ p = 1 - F_{\chi^2_{k-1}}(\chi^2_{\text{calc}}) $$ where \(F\) is the cumulative distribution function of the Chi-square distribution.
Interpretation Threshold
- If \(p < 0.001\) (\(0.1\%\) significance): The null hypothesis of perfect assignment is rejected. An SRM is highly likely, signaling a telemetry or system bug that invalidates downstream causal inferences.
- Common causes of SRM: browser-specific treatment crashes, asymmetric page-redirection delays, bot filters interacting with treatment flags, or mid-experiment changes in allocation rates.
| PARAMETER | DESCRIPTION |
|---|---|
observed_counts
|
The actual recorded sample sizes allocated to each variant (e.g.,
TYPE:
|
expected_ratios
|
The target allocation proportions or relative weights (e.g.,
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
The calculated p-value of the goodness-of-fit test.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
SRMError
|
If the computed p-value is strictly less than 0.001, indicating a severe, non-random mismatch. |
Examples:
Example
Source code in src\xpyrment\validate\srm.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |