Stratification
stratification
Stratified and clustered randomization engines for balanced covariate distributions.
This module provides stratified randomization algorithms. Stratification ensures that critical pre-experiment covariates (e.g., country, customer segment, historical purchase volume) are balanced proportionally across all treatment arms, reducing pre-experiment bias and enhancing statistical power.
| FUNCTION | DESCRIPTION |
|---|---|
stratified_randomization |
Performs stratified randomization to ensure balance on continuous/categorical covariates. |
stratified_randomization
stratified_randomization(
df: DataFrame,
strata_cols: list,
variants: Optional[List[str]] = None,
treatment_col: str = "variant",
random_state: Optional[int] = None,
) -> DataFrame
Performs stratified randomization to ensure balance on continuous/categorical covariates.
In simple randomization, small sample sizes or highly variable covariates can lead to accidental imbalance across treatment arms, introducing selection bias. Stratified randomization resolves this by dividing the population into mutually exclusive, homogeneous subgroups (strata) based on the provided covariates, and then executing independent randomization within each individual stratum.
Mathematical and Algorithmic Background
Let \(D\) be the dataset of size \(N\). Let \(C = \{c_1, c_2, \dots, c_p\}\) be the set of stratification columns. 1. Strata Construction: We partition \(D\) into \(K\) disjoint subsets (strata), \(\{D_1, D_2, \dots, D_K\}\), such that within each subset \(D_j\), all units share identical values for all stratification columns \(C\): $$ D = \bigcup_{j=1}^{K} D_j \quad \text{where} \quad D_a \cap D_b = \emptyset \ \ \forall \ a \neq b $$ 2. Intra-Stratum Randomization: For each stratum \(D_j\), units are randomly permuted and assigned to treatment arms. This guarantees that if treatment arm proportions are \(\{w_1, w_2, \dots, w_k\}\), then within every stratum \(D_j\), the assignment counts follow: $$ n_{j, \text{arm } i} \approx w_i \times |D_j| $$ This reduces the variance of the treatment effect estimator by removing the variance contribution of the stratification covariates.
| PARAMETER | DESCRIPTION |
|---|---|
df
|
The input DataFrame containing experimental units and their covariate values.
TYPE:
|
strata_cols
|
List of column names in
TYPE:
|
variants
|
Ordered list of variant labels. Defaults to
TYPE:
|
treatment_col
|
Column name where the assigned variant label will be written. Defaults to
TYPE:
|
random_state
|
Integer seed to initialize local random state generator for reproducibility.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: A new DataFrame with treatment assignments balanced across the specified strata. |