Simulation
simulation
Synthetic data generators for validation, profiling, and unit testing.
This module provides the generate_ab_data utility, which generates realistic experimental datasets containing continuous,
binary, and ratio variables across pre-period and post-period windows with stochastic correlation structures.
| CLASS | DESCRIPTION |
|---|---|
ExperimentSimulator |
Runs extensive Monte Carlo simulations to validate experimental designs and algorithms. |
| FUNCTION | DESCRIPTION |
|---|---|
generate_ab_data |
Generates synthetic A/B test data simulating continuous, binary, and ratio metrics. |
ExperimentSimulator
Runs extensive Monte Carlo simulations to validate experimental designs and algorithms.
TODO: Add synthetic panel non-compliance treatment estimation corrections using Instrumental Variables (LATE/CACE). TODO: Support synthetic network topology configurations to simulate clustered graph-based interference spillovers.
| METHOD | DESCRIPTION |
|---|---|
generate_synthetic_panel |
Generates a synthetic panel with potential non-compliance and spillover. |
run_monte_carlo |
Runs repeated Monte Carlo trials to compute empirical power, bias, and MSE of a standard T-test. |
Source code in src\xpyrment\simulation.py
generate_synthetic_panel
generate_synthetic_panel(
n_samples: int = 1000,
baseline_mean: float = 10.0,
treatment_effect: float = 1.0,
non_compliance_rate: float = 0.0,
spillover_effect: float = 0.0,
) -> DataFrame
Generates a synthetic panel with potential non-compliance and spillover.
| PARAMETER | DESCRIPTION |
|---|---|
n_samples
|
Number of experimental units.
TYPE:
|
baseline_mean
|
Baseline outcome intercept.
TYPE:
|
treatment_effect
|
Treatment causal impact (CATE).
TYPE:
|
non_compliance_rate
|
Probability of failing to comply with assignment.
TYPE:
|
spillover_effect
|
Causal impact spilled over onto control units from treatment.
TYPE:
|
Source code in src\xpyrment\simulation.py
run_monte_carlo
run_monte_carlo(
n_simulations: int = 50,
n_samples: int = 500,
baseline_mean: float = 10.0,
treatment_effect: float = 1.0,
non_compliance_rate: float = 0.0,
spillover_effect: float = 0.0,
alpha: float = 0.05,
) -> Dict[str, Any]
Runs repeated Monte Carlo trials to compute empirical power, bias, and MSE of a standard T-test.
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: Calculated statistical performance metrics. |
Source code in src\xpyrment\simulation.py
generate_ab_data
generate_ab_data(
n_samples: int = 10000,
treatment_fraction: float = 0.5,
baseline_revenue: float = 10.0,
treatment_effect_revenue: float = 0.5,
baseline_conversion: float = 0.15,
treatment_effect_conversion: float = 0.02,
baseline_clicks_mean: float = 5.0,
baseline_impressions_mean: float = 100.0,
treatment_effect_clicks: float = 0.3,
pre_period_correlation: float = 0.7,
random_seed: int = 42,
) -> DataFrame
Generates synthetic A/B test data simulating continuous, binary, and ratio metrics.
Constructs a high-fidelity synthetic evaluation dataset. This generator is crucial for testing the validity of the statistical engines, validating Type I / Type II error rates, and profiling variance reduction (CUPED) performance. It generates both pre-period and post-period metrics to support covariate-adjustment modeling.
Mathematical and Generative Specifications
- Continuous Metric (Revenue) with Pre/Post Covariance: Revenue is modeled using a bivariate normal distribution to inject a pre-defined correlation (\(\rho\)) between pre-period (covariate) and post-period (outcome) performance.
- Let \(Y_i = [Y_{i, \text{pre}}, Y_{i, \text{post}}]^T\) be the revenue vector for unit \(i\).
- Under the Control variant, the mean vector is \(\boldsymbol{\mu}_C = [\mu_{\text{baseline}}, \mu_{\text{baseline}}]^T\).
- Under the Treatment variant, the mean vector is \(\boldsymbol{\mu}_T = [\mu_{\text{baseline}}, \mu_{\text{baseline}} + \delta_{\text{rev}}]^T\).
- The covariance matrix \(\boldsymbol{\Sigma}\) is configured using standard deviation \(\sigma\) and target correlation \(\rho\): $$ \boldsymbol{\Sigma} = \begin{bmatrix} \sigma^2 & \rho \sigma^2 \ \rho \sigma^2 & \sigma^2 \end{bmatrix} $$
- We sample \(Y_i \sim \mathcal{N}_2(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})\) and apply a non-negative floor: $$ Y_{i} \leftarrow \max(Y_i, 0) $$
- Binary Rate Metric (Conversions): Conversions are modeled as independent Bernoulli trials:
- For Control: \(Converted_i \sim \text{Bernoulli}(p_C)\) where \(p_C = p_{\text{baseline}}\).
-
For Treatment: \(Converted_i \sim \text{Bernoulli}(p_T)\) where \(p_T = \min(\max(p_{\text{baseline}} + \delta_{\text{conv}}, 0), 1)\).
-
Ratio Metric (Clicks and Impressions for Click-Through Rate): Simulates CTR stochastically, introducing user-level heterogeneity and a positive skew:
- Post-period impressions follow a Poisson distribution: $$ Impressions_i \sim \text{Poisson}(\lambda_{\text{baseline_impressions}}) $$ with a minimum threshold of \(1\) to prevent divisions by zero.
- Click probabilities for each user follow a Beta distribution to model user variance (Beta-Binomial stochastics): $$ p_{i, \text{CTR}} \sim \text{Beta}(a_k, b_k) $$ where the shape parameters \(a_k, b_k\) are derived to match the expected CTR of the respective group: $$ a_k = \text{CTR}_k \times 10, \quad b_k = (1 - \text{CTR}_k) \times 10 $$
- Finally, individual clicks are simulated using Binomial trials: $$ Clicks_i \sim \text{Binomial}(Impressions_i, \ p_{i, \text{CTR}}) $$
Args: n_samples (int): The total number of experimental units (users) to simulate. Defaults to 10000. treatment_fraction (float): The probability of assignment to the Treatment group. Defaults to 0.5. baseline_revenue (float): The baseline mean revenue (\(\mu_{\text{baseline}}\)). Defaults to 10.0. treatment_effect_revenue (float): The absolute revenue lift in Treatment (\(\delta_{\text{rev}}\)). Defaults to 0.5. baseline_conversion (float): The baseline conversion rate (\(p_{\text{baseline}}\)). Defaults to 0.15. treatment_effect_conversion (float): The absolute conversion lift in Treatment (\(\delta_{\text{conv}}\)). Defaults to 0.02. baseline_clicks_mean (float): Baseline expected clicks. Defaults to 5.0. baseline_impressions_mean (float): Baseline expected impressions (\(\lambda\)). Defaults to 100.0. treatment_effect_clicks (float): Incremental clicks in Treatment. Defaults to 0.3. pre_period_correlation (float): The target correlation coefficient (\(\rho\)) between pre and post continuous revenue. Defaults to 0.70. random_seed (int): Pseudo-random seed to guarantee reproducibility. Defaults to 42.
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: A DataFrame containing simulated user IDs, variant assignments, and pre/post metrics:
- |
Source code in src\xpyrment\simulation.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 | |