Inference Module

The xpyrment.analyze.inference module contains submodules and components for inference.

inference

Statistical inference engines, frameworks, and decision-making systems.

This package provides standard and state-of-the-art inferential frameworks for computing treatment effects, confidence bounds, and decision probabilities.

Submodules: - frequentist: Standard hypothesis tests including Welch's t-test and Mann-Whitney U rank sums. - bayesian: Conjugate posterior models (Beta-Binomial, Normal-Normal) and expected loss decisions. - sequential: Always-Valid Confidence Intervals (AVCIs) derived from mSPRT boundaries. - bootstrap: Non-parametric percentile and BCa resampling estimators. - router: Automated matching between metric structures, designs, and calculation engines.

MODULE	DESCRIPTION
`bayesian`	Bayesian conjugate models and decision-making parameters.
`bootstrap`	Non-parametric bootstrap resampling and confidence interval estimation (Block 54).
`frequentist`	Frequentist parametric and non-parametric statistical tests.
`router`	Intelligent statistical inference engine router.
`sequential`	Sequential inference, continuous peeking, and always-valid confidence intervals (AVCI).

CLASS	DESCRIPTION
`BayesianInference`	Computes Bayesian posterior parameters, probability of being best, expected loss, and ROPE.
`SequentialInference`	Computes sequential monitoring bounds and always-valid confidence intervals (AVCIs).

FUNCTION	DESCRIPTION
`run_bootstrap_ci`	Computes non-parametric bootstrap confidence intervals for arbitrary complex metrics.
`run_block_bootstrap_ci`	Computes non-parametric block bootstrap confidence intervals for dependent/autocorrelated data.
`run_welch_t_test`	Performs Welch's t-test for difference of means with unequal variances.
`run_mann_whitney_u`	Performs nonparametric Mann-Whitney U test for ordinal or non-normal continuous data.
`route_inference_engine`	Routes to the correct statistical engine based on metric and design types.

BayesianInference

BayesianInference(model_type: str = 'beta_binomial')

Computes Bayesian posterior parameters, probability of being best, expected loss, and ROPE.

Bayesian inference offers a direct, probabilistic interpretation of treatment effects, avoiding the complex and frequently misunderstood reasoning of frequentist p-values. It provides answers to intuitive questions like: "What is the probability that Treatment B is superior to Control A?" or "What is the expected loss if I ship Treatment B?"

Mathematical Formulation of Conjugate Models

Conjugate models allow the analytical calculation of posterior distributions without requiring expensive Markov Chain Monte Carlo (MCMC) sampling:

Beta-Binomial Model (for binary conversion rates, $p \in [0, 1]$):
Prior: $p \sim \text{Beta}(\alpha_0, \beta_0)$ (e.g., $\text{Beta}(1, 1)$ for a flat, uniform prior).
Likelihood: Binomial ($k$ conversions out of $n$ trials).
Posterior: $$ p|k, n \sim \text{Beta}(\alpha_0 + k, \ \beta_0 + n - k) $$
Normal-Normal Model (for continuous averages, $\mu \in \mathbb{R}$, with known variance $\sigma^2$):
Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$.
Likelihood: Normal ($N$ observations with sample mean $\bar{Y}$ and variance $\sigma^2$).
Posterior: $$ \mu| \bar{Y} \sim \mathcal{N}(\mu_N, \sigma_N^2) $$ where the posterior precision ($1/\sigma_N^2$) and posterior mean ($\mu_N$) are calculated as: $$ \frac{1}{\sigma_N^2} = \frac{1}{\sigma_0^2} + \frac{N}{\sigma^2} \quad \text{and} \quad \mu_N = \sigma_N^2 \left( \frac{\mu_0}{\sigma_0^2} + \frac{N\bar{Y}}{\sigma^2} \right) $$
Gamma-Poisson Model (for discrete counts/rates, $\lambda \in [0, \infty)$):
Prior: $\lambda \sim \text{Gamma}(\alpha, \beta)$ (shape $\alpha$, rate $\beta$).
Likelihood: Poisson ($k$ total counts across $n$ units).
Posterior: $$ \lambda|k, n \sim \text{Gamma}(\alpha + k, \beta + n) $$

Decision-Making Criteria and Analytics

Probability of Being Best (PBB): The probability that the treatment parameter $\theta_T$ is strictly greater than the control parameter $\theta_C$: $$ \text{PBB} = P(\theta_T > \theta_C) = \int_{-\infty}^{\infty} f_C(\theta_C) [1 - F_T(\theta_C)] \, d\theta_C $$
Expected Loss ($L$): The expected metric drop if the treatment is shipped but is actually inferior: $$ L(T) = \mathbb{E}[\max(\theta_C - \theta_T, 0)] = \int_{-\infty}^{\infty} \int_{-\infty}^{\theta_C} (\theta_C - \theta_T) f_T(\theta_T) f_C(\theta_C) \, d\theta_T \, d\theta_C $$

ATTRIBUTE	DESCRIPTION
`model_type`	Conjugate model pairing label. Options: `"beta_binomial"`, `"normal_normal"`, `"gamma_poisson"`. Defaults to `"beta_binomial"`. TYPE: `str`

PARAMETER	DESCRIPTION
`model_type`	Conjugate model to use (`"beta_binomial"`, `"normal_normal"`, or `"gamma_poisson"`). Defaults to `"beta_binomial"`. TYPE: `str` DEFAULT: `'beta_binomial'`

METHOD	DESCRIPTION
`estimate_posterior`	Estimates conjugate posterior distributions based on prior settings and raw observations.

Source code in src\xpyrment\analyze\inference\bayesian.py

def __init__(self, model_type: str = "beta_binomial"):
    """Initializes a BayesianInference.

    Args:
        model_type (str): Conjugate model to use (`"beta_binomial"`, `"normal_normal"`, or `"gamma_poisson"`).
            Defaults to `"beta_binomial"`.
    """
    self.model_type = model_type

estimate_posterior

estimate_posterior(
    prior_params: dict,
    observed_data: dict,
    exact: bool = True,
) -> dict

Estimates conjugate posterior distributions based on prior settings and raw observations.

Performs the analytical conjugate update formulas, then computes PBB, Expected Loss, and credible intervals.

PARAMETER	DESCRIPTION
`prior_params`	Prior parameters (e.g., `{"alpha": 1, "beta": 1}` or `{"mean": 0, "variance": 1}`). TYPE: `dict`
`observed_data`	Observed outcomes (conversions and counts, or means and variances). TYPE: `dict`
`exact`	Whether to use numerical integration for exact PBB and Expected Loss. If False, uses Monte Carlo sampling (20k samples). Defaults to True. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`dict`	Posterior distribution parameters, credible intervals, and decision metrics. TYPE: `dict`

Source code in src\xpyrment\analyze\inference\bayesian.py

def estimate_posterior(self, prior_params: dict, observed_data: dict, exact: bool = True) -> dict:
    """Estimates conjugate posterior distributions based on prior settings and raw observations.

    Performs the analytical conjugate update formulas, then computes PBB, Expected Loss, and
    credible intervals.

    Args:
        prior_params (dict): Prior parameters (e.g., `{"alpha": 1, "beta": 1}` or `{"mean": 0, "variance": 1}`).
        observed_data (dict): Observed outcomes (conversions and counts, or means and variances).
        exact (bool): Whether to use numerical integration for exact PBB and Expected Loss.
            If False, uses Monte Carlo sampling (20k samples). Defaults to True.

    Returns:
        dict: Posterior distribution parameters, credible intervals, and decision metrics.
    """
    import numpy as np
    from scipy import stats, integrate

    if self.model_type == "beta_binomial":
        # Prior parameters
        a0 = prior_params.get("alpha", 1.0)
        b0 = prior_params.get("beta", 1.0)

        # Observed data
        k_c = observed_data.get("control_successes", observed_data.get("k_c", 0))
        n_c = observed_data.get("control_trials", observed_data.get("n_c", 1))
        k_t = observed_data.get("treatment_successes", observed_data.get("k_t", 0))
        n_t = observed_data.get("treatment_trials", observed_data.get("n_t", 1))

        # Posterior parameters
        dist_c = stats.beta(a0 + k_c, b0 + n_c - k_c)
        dist_t = stats.beta(a0 + k_t, b0 + n_t - k_t)

        res = {
            "control_posterior": {"param1": float(dist_c.args[0]), "param2": float(dist_c.args[1])},
            "treatment_posterior": {"param1": float(dist_t.args[0]), "param2": float(dist_t.args[1])}
        }

    elif self.model_type == "normal_normal":
        # Prior parameters
        mu_0 = prior_params.get("mean", prior_params.get("mu_0", 0.0))
        var_0 = prior_params.get("variance", prior_params.get("sigma_0_sq", 1.0))

        # Observed data
        mean_c = observed_data.get("control_mean", observed_data.get("mean_c", 0.0))
        var_c = observed_data.get("control_variance", observed_data.get("var_c", 1.0))
        n_c = observed_data.get("control_n", observed_data.get("n_c", 1))

        mean_t = observed_data.get("treatment_mean", observed_data.get("mean_t", 0.0))
        var_t = observed_data.get("treatment_variance", observed_data.get("var_t", 1.0))
        n_t = observed_data.get("treatment_n", observed_data.get("n_t", 1))

        # Posterior calculation
        prec_0 = 1.0 / var_0

        prec_c = prec_0 + n_c / var_c
        var_c_post = 1.0 / prec_c
        mu_c_post = var_c_post * (mu_0 * prec_0 + n_c * mean_c / var_c)

        prec_t = prec_0 + n_t / var_t
        var_t_post = 1.0 / prec_t
        mu_t_post = var_t_post * (mu_0 * prec_0 + n_t * mean_t / var_t)

        dist_c = stats.norm(mu_c_post, np.sqrt(var_c_post))
        dist_t = stats.norm(mu_t_post, np.sqrt(var_t_post))

        res = {
            "control_posterior": {"param1": float(mu_c_post), "param2": float(var_c_post)},
            "treatment_posterior": {"param1": float(mu_t_post), "param2": float(var_t_post)}
        }

    elif self.model_type == "gamma_poisson":
        # Prior parameters: shape alpha, rate beta (rate = 1/scale)
        a0 = prior_params.get("alpha", prior_params.get("shape", 1.0))
        b0 = prior_params.get("beta", prior_params.get("rate", 1.0))

        # Observed data: k counts in n exposure units
        k_c = observed_data.get("control_counts", observed_data.get("k_c", 0))
        n_c = observed_data.get("control_exposure", observed_data.get("n_c", 1))
        k_t = observed_data.get("treatment_counts", observed_data.get("k_t", 0))
        n_t = observed_data.get("treatment_exposure", observed_data.get("n_t", 1))

        # Posterior parameters: alpha_post = alpha + k, beta_post = beta + n
        # Scipy gamma uses scale = 1/beta
        dist_c = stats.gamma(a0 + k_c, scale=1.0 / (b0 + n_c))
        dist_t = stats.gamma(a0 + k_t, scale=1.0 / (b0 + n_t))

        res = {
            "control_posterior": {"param1": float(dist_c.args[0]), "param2": float(1.0 / dist_c.extra_args[0] if hasattr(dist_c, 'extra_args') else 1.0/dist_c.kwds['scale'])},
            "treatment_posterior": {"param1": float(dist_t.args[0]), "param2": float(1.0 / dist_t.extra_args[0] if hasattr(dist_t, 'extra_args') else 1.0/dist_t.kwds['scale'])}
        }
        # Fix param2 extraction for robustness
        res["control_posterior"]["param2"] = float(a0 + k_c) # wait no, param2 is rate
        res["control_posterior"]["param1"] = float(a0 + k_c)
        res["control_posterior"]["param2"] = float(b0 + n_c)
        res["treatment_posterior"]["param1"] = float(a0 + k_t)
        res["treatment_posterior"]["param2"] = float(b0 + n_t)

    else:
        raise ValueError(f"Unknown Bayesian model type: {self.model_type}")

    # Common metrics
    ci_c_lower, ci_c_upper = dist_c.ppf([0.025, 0.975])
    ci_t_lower, ci_t_upper = dist_t.ppf([0.025, 0.975])
    res["control_posterior"].update({"ci_lower": float(ci_c_lower), "ci_upper": float(ci_c_upper)})
    res["treatment_posterior"].update({"ci_lower": float(ci_t_lower), "ci_upper": float(ci_t_upper)})

    if exact:
        # Determine integration bounds based on the spread of the distributions
        # We use the [0.0001, 0.9999] quantile range to cover the meaningful parts of the PDF
        low_c, high_c = dist_c.ppf([0.0001, 0.9999])
        low_t, high_t = dist_t.ppf([0.0001, 0.9999])

        # Use the union of both ranges plus a small buffer
        lower = min(low_c, low_t)
        upper = max(high_c, high_t)

        # If the distribution is very narrow (e.g. at zero), ensure we have some width
        if upper <= lower:
            upper = lower + 1.0

        # Exact PBB: integral of f_c(x) * (1 - F_t(x))
        pbb, _ = integrate.quad(lambda x: dist_c.pdf(x) * (1 - dist_t.cdf(x)), lower, upper, limit=100)

        # Exact Expected Loss: integral of f_c(x) * integral_{-inf}^{x} (x - y) * f_t(y) dy
        # Which is: integral f_c(x) * [x * F_t(x) - integral_{-inf}^{x} y * f_t(y) dy]
        def loss_integrand(x):
            # Integral_{lower}^{x} y * f_t(y) dy is the partial mean
            inner_val, _ = integrate.quad(lambda y: y * dist_t.pdf(y), lower, x, limit=50)
            return dist_c.pdf(x) * (x * dist_t.cdf(x) - inner_val)

        expected_loss, _ = integrate.quad(loss_integrand, lower, upper, limit=100)
    else:
        # Monte Carlo fallback
        samples_c = dist_c.rvs(size=20000, random_state=42)
        samples_t = dist_t.rvs(size=20000, random_state=42)
        pbb = float(np.mean(samples_t > samples_c))
        expected_loss = float(np.mean(np.maximum(samples_c - samples_t, 0.0)))

    res.update({"pbb": float(pbb), "expected_loss": float(expected_loss)})
    return res

SequentialInference

Computes sequential monitoring bounds and always-valid confidence intervals (AVCIs).

In traditional experimentation, looking at confidence intervals before the test finishes (peeking) is statistically hazardous. Always-Valid Confidence Intervals (AVCIs) solve this by providing a sequence of intervals that cover the true parameter $\\theta$ at all steps $n \\ge 1$ simultaneously with a probability of at least $1 - \\alpha$.

Mathematical Definition and Boundary Formulas

Let $C_n$ be the confidence interval calculated at sample size $n$. For any nominal error rate $\\alpha \\in (0, 1)$: $$ \mathbb{P} \left( \forall n \ge 1, \ \theta \in C_n \right) \ge 1 - \alpha $$ This is an incredibly powerful property: it allows the experimenter to continuously plot the confidence interval over time, and if the interval does not contain zero at any point, the experiment can be stopped immediately with a guaranteed Type I error rate controlled at $\\alpha$.

AVCI Derivation from mSPRT: By inverting the mixture Sequential Probability Ratio Test (mSPRT) statistic for a normally distributed metric with unit baseline variance $\\sigma^2$ and mixing tuning parameter $\\tau^2$ (representing prior effect variance), the always-valid confidence interval at step $n$ is: $$ C_n = \left[ \bar{Y}_n - W_n, \ \bar{Y}_n + W_n \right] $$ where $\\bar{Y}_n$ is the observed sample mean difference, and the sequential margin of error $W_n$ is: $$ W_n = \sqrt{\frac{2\sigma^2(\sigma^2 + n\tau^2)}{n^2\tau^2} \ln\left( \frac{1}{\alpha} \sqrt{\frac{\sigma^2 + n\tau^2}{\sigma^2}} \right)} $$ Properties of $W_n$: - For very small $n$, $W_n$ is wider than the traditional fixed-sample Wald margin of error ($z_{1-\\alpha/2} \\sigma / \\sqrt{n}$), which mathematically penalizes and compensates for the continuous peeking. - As $n \\to \\infty$, the sequential margin $W_n$ asymptotically matches the rate of the traditional interval, ensuring zero long-term loss in statistical efficiency.

METHOD	DESCRIPTION
`calculate_always_valid_ci`	Computes continuous monitoring boundaries to prevent alpha inflation from peeking.

calculate_always_valid_ci

calculate_always_valid_ci(
    sample_size: int, alpha: float
) -> tuple

Computes continuous monitoring boundaries to prevent alpha inflation from peeking.

Calculates the exact sequential margin of error ($W_n$) at a given sample size, yielding always-valid confidence bounds around the observed effect size.

PARAMETER	DESCRIPTION
`sample_size`	The current accumulated number of units/trials ($n$). TYPE: `int`
`alpha`	The target false-positive rate ($\\alpha$). TYPE: `float`

RETURNS	DESCRIPTION
`tuple`	A tuple of floats `(lower_bound, upper_bound)` representing the always-valid interval bounds relative to the mean difference. TYPE: `tuple`

Source code in src\xpyrment\analyze\inference\sequential.py

def calculate_always_valid_ci(self, sample_size: int, alpha: float) -> tuple:
    r"""Computes continuous monitoring boundaries to prevent alpha inflation from peeking.

    Calculates the exact sequential margin of error ($W_n$) at a given sample size, yielding
    always-valid confidence bounds around the observed effect size.

    Args:
        sample_size (int): The current accumulated number of units/trials ($n$).
        alpha (float): The target false-positive rate ($\\alpha$).

    Returns:
        tuple: A tuple of floats `(lower_bound, upper_bound)` representing the always-valid interval bounds
            relative to the mean difference.
    """
    # TODO: Implement sequential boundary functions (mSPRT or Pocock)
    return (-1.0, 1.0)

run_bootstrap_ci

run_bootstrap_ci(
    data_group: ndarray,
    num_resamples: int = 2000,
    confidence_level: float = 0.95,
    method: str = "bca",
    random_seed: Optional[int] = None,
) -> tuple

Computes non-parametric bootstrap confidence intervals for arbitrary complex metrics.

Bootstrap resampling (Efron, 1979) is a non-parametric method used to estimate the standard error and confidence intervals of an estimator (such as means, medians, ratios, or quantiles). It is particularly valuable when the underlying metric distribution is highly non-normal (e.g., bi-modal, zero-inflated, or power-law) or when the estimator's mathematical variance cannot be easily derived analytically.

Mathematical and Algorithmic Formulation

Let \mathbf{x} = (x_1, x_2, \dots, x_n) be the observed sample of size $n$, and let \hat{\theta} = s(\mathbf{x}) be the point estimate of interest.

The bootstrap sampling distribution is constructed as follows: 1. Draw a bootstrap sample \mathbf{x}^{b} of size $n$ by sampling uniformly with replacement from the original sample \mathbf{x}$. 2. Calculate the bootstrap replication of the estimator: \\hat{\\theta}^{*b} = s(\\mathbf{x}^{*b})$. 3. Repeat steps 1-2 a large number of times $B$ (where $B = \\text{num\\_resamples}$, typically $B \\ge 2000$), generating a set of replicates: \{\hat{\theta}^{1}, \hat{\theta}^{2}, \dots, \hat{\theta}^{B}\}.

Confidence Interval Methods

Percentile Bootstrap (Simple and intuitive): Sorts the bootstrap replicates in ascending order: \hat{\theta}^{(1)} \le \hat{\theta}^{(2)} \le \dots \le \hat{\theta}^{(B)}. For a confidence level of $1 - \\alpha$ (e.g., $0.95$ with $\\alpha = 0.05$), the interval endpoints are the $\\alpha/2$ and $1 - \\alpha/2$ percentiles of the empirical bootstrap distribution: $$ \left[ \hat{\theta}^{(\lfloor B \cdot \alpha/2 \rfloor)}, \ \hat{\theta}^{*(\lfloor B \cdot (1 - \alpha/2) \rfloor)} \right] $$
Bias-Corrected and Accelerated (BCa) Bootstrap (Robust and accurate): Adjusts the percentile endpoints to correct for both median bias (displacement of the bootstrap distribution from the point estimate) and skewness (non-constant variance, represented by acceleration $a$).
The bias-correction factor $z_0$ is: $$ z_0 = \Phi^{-1} \left( \frac{\#\{\hat{\theta}^{*b} < \hat{\theta}\}}{B} \right) $$ where $\\Phi^{-1}$ is the inverse cumulative distribution function of the standard normal distribution.
The acceleration parameter $a$ is computed using jackknife (leave-one-out) estimators: $$ a = \frac{\sum_{i=1}^{n} (\bar{\theta}{(\cdot)} - \theta{(i)})^3}{6 \left[ \sum_{i=1}^{n} (\bar{\theta}{(\cdot)} - \theta{(i)})^2 \right]^{3/2}} $$ where $\\theta_{(i)}$ is the estimate of $\\theta$ calculated by omitting the $i$-th observation, and $\\bar{\\theta}_{(\\cdot)}$ is the average of these jackknife estimates.
Transformed confidence percentiles are then mapped back to the sorted replicates to construct the interval.

PARAMETER	DESCRIPTION
`data_group`	The raw 1D array of observed values. TYPE: `ndarray`
`num_resamples`	The number of bootstrap iterations ($B$). Defaults to 2000. TYPE: `int` DEFAULT: `2000`
`confidence_level`	The desired confidence interval width ($1 - \\alpha$). Defaults to 0.95. TYPE: `float` DEFAULT: `0.95`
`method`	The bootstrap method to utilize (`"percentile"` or `"bca"`). Defaults to `"bca"`. TYPE: `str` DEFAULT: `'bca'`
`random_seed`	Random seed for numpy generator reproducibility. Defaults to None. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple`	A tuple of floats `(lower_bound, upper_bound)` representing the calculated interval. TYPE: `tuple`

Source code in src\xpyrment\analyze\inference\bootstrap.py

def run_bootstrap_ci(
    data_group: np.ndarray,
    num_resamples: int = 2000,
    confidence_level: float = 0.95,
    method: str = "bca",
    random_seed: Optional[int] = None,
) -> tuple:
    r"""Computes non-parametric bootstrap confidence intervals for arbitrary complex metrics.

    Bootstrap resampling (Efron, 1979) is a non-parametric method used to estimate the standard error and confidence
    intervals of an estimator (such as means, medians, ratios, or quantiles). It is particularly valuable when the
    underlying metric distribution is highly non-normal (e.g., bi-modal, zero-inflated, or power-law) or when the
    estimator's mathematical variance cannot be easily derived analytically.

    Mathematical and Algorithmic Formulation:
        Let \\mathbf{x} = (x_1, x_2, \\dots, x_n) be the observed sample of size $n$, and let \\hat{\\theta} = s(\\mathbf{x})
        be the point estimate of interest.

        The bootstrap sampling distribution is constructed as follows:
        1. Draw a bootstrap sample \\mathbf{x}^{*b} of size $n$ by sampling uniformly **with replacement** from the
           original sample \\mathbf{x}$.
        2. Calculate the bootstrap replication of the estimator: \\hat{\\theta}^{*b} = s(\\mathbf{x}^{*b})$.
        3. Repeat steps 1-2 a large number of times $B$ (where $B = \\text{num\\_resamples}$, typically $B \\ge 2000$),
           generating a set of replicates: \\{\\hat{\\theta}^{*1}, \\hat{\\theta}^{*2}, \\dots, \\hat{\\theta}^{*B}\\}.

    Confidence Interval Methods:
        1. **Percentile Bootstrap** (Simple and intuitive):
           Sorts the bootstrap replicates in ascending order: \\hat{\\theta}^{*(1)} \\le \\hat{\\theta}^{*(2)} \\le \\dots \\le \\hat{\\theta}^{*(B)}.
           For a confidence level of $1 - \\alpha$ (e.g., $0.95$ with $\\alpha = 0.05$), the interval endpoints are the
           $\\alpha/2$ and $1 - \\alpha/2$ percentiles of the empirical bootstrap distribution:
           $$
           \\left[ \\hat{\\theta}^{*(\\lfloor B \\cdot \\alpha/2 \\rfloor)}, \\ \\hat{\\theta}^{*(\\lfloor B \\cdot (1 - \\alpha/2) \\rfloor)} \\right]
           $$
        2. **Bias-Corrected and Accelerated (BCa) Bootstrap** (Robust and accurate):
           Adjusts the percentile endpoints to correct for both median bias (displacement of the bootstrap distribution
           from the point estimate) and skewness (non-constant variance, represented by acceleration $a$).
           - The bias-correction factor $z_0$ is:
             $$
             z_0 = \\Phi^{-1} \\left( \\frac{\\#\\{\\hat{\\theta}^{*b} < \\hat{\\theta}\\}}{B} \\right)
             $$
             where $\\Phi^{-1}$ is the inverse cumulative distribution function of the standard normal distribution.
           - The acceleration parameter $a$ is computed using jackknife (leave-one-out) estimators:
             $$
             a = \\frac{\\sum_{i=1}^{n} (\\bar{\\theta}_{(\\cdot)} - \\theta_{(i)})^3}{6 \\left[ \\sum_{i=1}^{n} (\\bar{\\theta}_{(\\cdot)} - \\theta_{(i)})^2 \\right]^{3/2}}
             $$
             where $\\theta_{(i)}$ is the estimate of $\\theta$ calculated by omitting the $i$-th observation, and
             $\\bar{\\theta}_{(\\cdot)}$ is the average of these jackknife estimates.
           - Transformed confidence percentiles are then mapped back to the sorted replicates to construct the interval.

    Args:
        data_group (np.ndarray): The raw 1D array of observed values.
        num_resamples (int): The number of bootstrap iterations ($B$). Defaults to 2000.
        confidence_level (float): The desired confidence interval width ($1 - \\alpha$). Defaults to 0.95.
        method (str): The bootstrap method to utilize (`"percentile"` or `"bca"`). Defaults to `"bca"`.
        random_seed (Optional[int]): Random seed for numpy generator reproducibility. Defaults to None.

    Returns:
        tuple: A tuple of floats `(lower_bound, upper_bound)` representing the calculated interval.
    """
    n = len(data_group)
    if n == 0:
        raise ValueError("Cannot run bootstrap on empty data group.")

    if method not in ("percentile", "bca"):
        raise ValueError(f"Unknown bootstrap method: {method}. Choose 'percentile' or 'bca'.")

    # Standard point estimate
    point_est = float(np.mean(data_group))

    # Guard against zero-variance or constant arrays (robust fallback)
    if np.var(data_group) < 1e-12 or np.all(data_group == data_group[0]):
        return (point_est, point_est)

    rng = np.random.default_rng(random_seed)

    # Prevent memory leaks or OOM exceptions via smart batch/chunk size calculation
    max_elements = 10_000_000
    total_elements = num_resamples * n

    replicates = np.zeros(num_resamples)

    if total_elements <= max_elements:
        # Fully vectorized execution in a single fast NumPy operation
        indices = rng.choice(n, size=(num_resamples, n), replace=True)
        replicates = np.mean(data_group[indices], axis=1)
    else:
        # Chunked vectorized evaluation to prevent memory spike
        chunk_size = max(1, max_elements // n)
        for start_idx in range(0, num_resamples, chunk_size):
            end_idx = min(start_idx + chunk_size, num_resamples)
            current_chunk_size = end_idx - start_idx
            indices = rng.choice(n, size=(current_chunk_size, n), replace=True)
            replicates[start_idx:end_idx] = np.mean(data_group[indices], axis=1)

    # Guard against perfectly degenerate bootstrap distributions
    if np.var(replicates) < 1e-12 or np.all(replicates == replicates[0]):
        return (point_est, point_est)

    alpha = 1.0 - confidence_level

    try:
        if method == "percentile":
            lower_pct = 100 * (alpha / 2.0)
            upper_pct = 100 * (1.0 - alpha / 2.0)
            lower_bound = float(np.percentile(replicates, lower_pct))
            upper_bound = float(np.percentile(replicates, upper_pct))
        elif method == "bca":
            # 1. Bias correction z0
            prop = np.mean(replicates < point_est)
            # Clip proportion to prevent norm.ppf boundary infinities
            prop = np.clip(prop, 1e-6, 1.0 - 1e-6)
            z0 = norm.ppf(prop)

            # 2. Acceleration parameter a using jackknife means
            if n > 1:
                sum_y = np.sum(data_group)
                jack_estimates = (sum_y - data_group) / (n - 1)
                mean_jack = np.mean(jack_estimates)
                diff = mean_jack - jack_estimates
                num = np.sum(diff ** 3)
                den = 6.0 * (np.sum(diff ** 2) ** 1.5)
                a = num / den if den > 1e-12 else 0.0
            else:
                a = 0.0

            # 3. Corrected percentiles
            z_alpha_2 = norm.ppf(alpha / 2.0)
            z_1_alpha_2 = norm.ppf(1.0 - alpha / 2.0)

            den1 = 1.0 - a * (z0 + z_alpha_2)
            den2 = 1.0 - a * (z0 + z_1_alpha_2)

            alpha_1 = norm.cdf(z0 + (z0 + z_alpha_2) / (den1 if abs(den1) > 1e-12 else 1e-12))
            alpha_2 = norm.cdf(z0 + (z0 + z_1_alpha_2) / (den2 if abs(den2) > 1e-12 else 1e-12))

            # Clamp back mapping boundaries
            alpha_1 = np.clip(alpha_1, 1e-6, 1.0 - 1e-6)
            alpha_2 = np.clip(alpha_2, 1e-6, 1.0 - 1e-6)

            lower_bound = float(np.percentile(replicates, alpha_1 * 100))
            upper_bound = float(np.percentile(replicates, alpha_2 * 100))
        else:
            raise ValueError(f"Unknown bootstrap method: {method}. Choose 'percentile' or 'bca'.")
    except Exception:
        # Graceful fallback to simple percentile under mathematical failures
        lower_pct = 100 * (alpha / 2.0)
        upper_pct = 100 * (1.0 - alpha / 2.0)
        lower_bound = float(np.percentile(replicates, lower_pct))
        upper_bound = float(np.percentile(replicates, upper_pct))

    # Final sanity bounds verification
    if np.isnan(lower_bound) or np.isnan(upper_bound) or np.isinf(lower_bound) or np.isinf(upper_bound):
        return (point_est, point_est)

    return (lower_bound, upper_bound)

run_block_bootstrap_ci

run_block_bootstrap_ci(
    data_group: ndarray,
    block_size: int,
    num_resamples: int = 2000,
    confidence_level: float = 0.95,
    bootstrap_method: str = "moving",
    ci_method: str = "percentile",
    random_seed: Optional[int] = None,
) -> tuple

Computes non-parametric block bootstrap confidence intervals for dependent/autocorrelated data.

Block bootstrap methods (Moving Block and Circular Block) resample contiguous blocks of data to preserve the temporal dependence structure within the time-series or sequence.

Mathematical Formulation of Block Bootstrap

Let $\mathbf{x} = (x_1, x_2, \dots, x_n)$ be a time series of length $n$, and let $b$ be the block size.

Moving Block Bootstrap (MBB): We define $N = n - b + 1$ overlapping blocks of length $b$: $$ B_i = (x_i, x_{i+1}, \dots, x_{i+b-1}), \quad \text{for } i = 1, \dots, N $$ We resample $k = \lceil n/b \rceil$ blocks with replacement from $\{B_1, \dots, B_N\}$, concatenate them, and truncate the final sequence to length $n$.
Circular Block Bootstrap (CBB): We define $n$ overlapping blocks of length $b$ by wrapping the time series circularly: $$ B_i = (x_i, x_{i+1}, \dots, x_{i+b-1}), \quad \text{for } i = 1, \dots, n $$ where index wrap-around uses modulo arithmetic: $x_j = x_{((j - 1) \bmod n) + 1}$. We resample $k = \lceil n/b \rceil$ blocks with replacement from $\{B_1, \dots, B_n\}$, concatenate them, and truncate the final sequence to length $n$.

PARAMETER	DESCRIPTION
`data_group`	The raw 1D array of observed values. TYPE: `ndarray`
`block_size`	The size of contiguous blocks ($b$). TYPE: `int`
`num_resamples`	The number of bootstrap iterations ($B$). Defaults to 2000. TYPE: `int` DEFAULT: `2000`
`confidence_level`	The desired confidence interval width ($1 - \alpha$). Defaults to 0.95. TYPE: `float` DEFAULT: `0.95`
`bootstrap_method`	The block bootstrap method (`"moving"` or `"circular"`). Defaults to `"moving"`. TYPE: `str` DEFAULT: `'moving'`
`ci_method`	The confidence interval estimation method (`"percentile"` or `"bca"`). Defaults to `"percentile"`. TYPE: `str` DEFAULT: `'percentile'`
`random_seed`	Random seed for numpy generator reproducibility. Defaults to None. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple`	A tuple of floats `(lower_bound, upper_bound)` representing the calculated interval. TYPE: `tuple`

Source code in src\xpyrment\analyze\inference\bootstrap.py

def run_block_bootstrap_ci(
    data_group: np.ndarray,
    block_size: int,
    num_resamples: int = 2000,
    confidence_level: float = 0.95,
    bootstrap_method: str = "moving",
    ci_method: str = "percentile",
    random_seed: Optional[int] = None,
) -> tuple:
    r"""Computes non-parametric block bootstrap confidence intervals for dependent/autocorrelated data.

    Block bootstrap methods (Moving Block and Circular Block) resample contiguous blocks of data to preserve
    the temporal dependence structure within the time-series or sequence.

    ??? mathbox "Mathematical Formulation of Block Bootstrap"
        Let $\mathbf{x} = (x_1, x_2, \dots, x_n)$ be a time series of length $n$, and let $b$ be the block size.

        1. **Moving Block Bootstrap (MBB)**:
           We define $N = n - b + 1$ overlapping blocks of length $b$:
           $$
           B_i = (x_i, x_{i+1}, \dots, x_{i+b-1}), \quad \text{for } i = 1, \dots, N
           $$
           We resample $k = \lceil n/b \rceil$ blocks with replacement from $\{B_1, \dots, B_N\}$, concatenate them,
           and truncate the final sequence to length $n$.

        2. **Circular Block Bootstrap (CBB)**:
           We define $n$ overlapping blocks of length $b$ by wrapping the time series circularly:
           $$
           B_i = (x_i, x_{i+1}, \dots, x_{i+b-1}), \quad \text{for } i = 1, \dots, n
           $$
           where index wrap-around uses modulo arithmetic: $x_j = x_{((j - 1) \bmod n) + 1}$. We resample $k = \lceil n/b \rceil$
           blocks with replacement from $\{B_1, \dots, B_n\}$, concatenate them, and truncate the final sequence to length $n$.

    Args:
        data_group (np.ndarray): The raw 1D array of observed values.
        block_size (int): The size of contiguous blocks ($b$).
        num_resamples (int): The number of bootstrap iterations ($B$). Defaults to 2000.
        confidence_level (float): The desired confidence interval width ($1 - \alpha$). Defaults to 0.95.
        bootstrap_method (str): The block bootstrap method (`"moving"` or `"circular"`). Defaults to `"moving"`.
        ci_method (str): The confidence interval estimation method (`"percentile"` or `"bca"`). Defaults to `"percentile"`.
        random_seed (Optional[int]): Random seed for numpy generator reproducibility. Defaults to None.

    Returns:
        tuple: A tuple of floats `(lower_bound, upper_bound)` representing the calculated interval.
    """
    n = len(data_group)
    if n == 0:
        raise ValueError("Cannot run block bootstrap on empty data group.")

    if block_size <= 0:
        raise ValueError(f"Block size must be a positive integer, got {block_size}.")

    if block_size > n:
        raise ValueError(f"Block size {block_size} cannot exceed data length {n}.")

    if bootstrap_method not in ("moving", "circular"):
        raise ValueError(f"Unknown bootstrap method: {bootstrap_method}. Choose 'moving' or 'circular'.")

    if ci_method not in ("percentile", "bca"):
        raise ValueError(f"Unknown confidence interval method: {ci_method}. Choose 'percentile' or 'bca'.")

    # Point estimate
    point_est = float(np.mean(data_group))

    # Guard against zero-variance or constant arrays
    if np.var(data_group) < 1e-12 or np.all(data_group == data_group[0]):
        return (point_est, point_est)

    rng = np.random.default_rng(random_seed)

    # Number of blocks to concatenate to reach length >= n
    k = int(np.ceil(n / block_size))

    # Determine the number of starting indices available
    if bootstrap_method == "moving":
        num_blocks = n - block_size + 1
    else:  # circular
        num_blocks = n

    # Smart chunk size calculation to prevent OOM
    max_elements = 10_000_000
    total_elements = num_resamples * n

    replicates = np.zeros(num_resamples)
    block_offsets = np.arange(block_size)

    # Chunked loop
    chunk_size = max(1, max_elements // n)
    for start_idx in range(0, num_resamples, chunk_size):
        end_idx = min(start_idx + chunk_size, num_resamples)
        current_chunk_size = end_idx - start_idx

        # Choose start indices of shape (current_chunk_size, k)
        start_indices = rng.choice(num_blocks, size=(current_chunk_size, k), replace=True)

        if bootstrap_method == "moving":
            # indices_2d: shape (current_chunk_size, k, block_size)
            indices_2d = start_indices[:, :, np.newaxis] + block_offsets[np.newaxis, np.newaxis, :]
            indices_flat = indices_2d.reshape(current_chunk_size, -1)
            indices_truncated = indices_flat[:, :n]
        else:  # circular
            indices_2d = (start_indices[:, :, np.newaxis] + block_offsets[np.newaxis, np.newaxis, :]) % n
            indices_flat = indices_2d.reshape(current_chunk_size, -1)
            indices_truncated = indices_flat[:, :n]

        # Compute statistic of interest (the mean of the bootstrap sample)
        replicates[start_idx:end_idx] = np.mean(data_group[indices_truncated], axis=1)

    # Guard against perfectly degenerate bootstrap distributions
    if np.var(replicates) < 1e-12 or np.all(replicates == replicates[0]):
        return (point_est, point_est)

    alpha = 1.0 - confidence_level

    try:
        if ci_method == "percentile":
            lower_pct = 100 * (alpha / 2.0)
            upper_pct = 100 * (1.0 - alpha / 2.0)
            lower_bound = float(np.percentile(replicates, lower_pct))
            upper_bound = float(np.percentile(replicates, upper_pct))
        elif ci_method == "bca":
            # 1. Bias correction z0
            prop = np.mean(replicates < point_est)
            # Clip proportion to prevent norm.ppf boundary infinities
            prop = np.clip(prop, 1e-6, 1.0 - 1e-6)
            z0 = norm.ppf(prop)

            # 2. Acceleration parameter a using jackknife means
            if n > 1:
                sum_y = np.sum(data_group)
                jack_estimates = (sum_y - data_group) / (n - 1)
                mean_jack = np.mean(jack_estimates)
                diff = mean_jack - jack_estimates
                num = np.sum(diff ** 3)
                den = 6.0 * (np.sum(diff ** 2) ** 1.5)
                a = num / den if den > 1e-12 else 0.0
            else:
                a = 0.0

            # 3. Corrected percentiles
            z_alpha_2 = norm.ppf(alpha / 2.0)
            z_1_alpha_2 = norm.ppf(1.0 - alpha / 2.0)

            den1 = 1.0 - a * (z0 + z_alpha_2)
            den2 = 1.0 - a * (z0 + z_1_alpha_2)

            alpha_1 = norm.cdf(z0 + (z0 + z_alpha_2) / (den1 if abs(den1) > 1e-12 else 1e-12))
            alpha_2 = norm.cdf(z0 + (z0 + z_1_alpha_2) / (den2 if abs(den2) > 1e-12 else 1e-12))

            # Clamp back mapping boundaries
            alpha_1 = np.clip(alpha_1, 1e-6, 1.0 - 1e-6)
            alpha_2 = np.clip(alpha_2, 1e-6, 1.0 - 1e-6)

            lower_bound = float(np.percentile(replicates, alpha_1 * 100))
            upper_bound = float(np.percentile(replicates, alpha_2 * 100))
    except Exception:
        # Graceful fallback to simple percentile under mathematical failures
        lower_pct = 100 * (alpha / 2.0)
        upper_pct = 100 * (1.0 - alpha / 2.0)
        lower_bound = float(np.percentile(replicates, lower_pct))
        upper_bound = float(np.percentile(replicates, upper_pct))

    # Final sanity bounds verification
    if np.isnan(lower_bound) or np.isnan(upper_bound) or np.isinf(lower_bound) or np.isinf(upper_bound):
        return (point_est, point_est)

    return (lower_bound, upper_bound)

run_welch_t_test

run_welch_t_test(
    group_a: ndarray, group_b: ndarray
) -> dict

Performs Welch's t-test for difference of means with unequal variances.

Welch's t-test is a two-sample location test used to test the hypothesis that two populations have equal means ($H_0: \\mu_A = \\mu_B$). Unlike Student's t-test, Welch's t-test does not assume equal variances, making it the standard default for digital and scientific A/B testing.

PARAMETER	DESCRIPTION
`group_a`	Array of numeric outcomes for control (Group A). TYPE: `ndarray`
`group_b`	Array of numeric outcomes for treatment (Group B). TYPE: `ndarray`

RETURNS	DESCRIPTION
`dict`	A dictionary containing: - `"t_statistic"` (float): The calculated Welch's t-value. - `"p_value"` (float): The two-sided p-value. - `"df"` (float): The approximated Satterthwaite degrees of freedom. - `"difference"` (float): Absolute difference between means ($\\bar{X}_B - \\bar{X}_A$). TYPE: `dict`

Source code in src\xpyrment\analyze\inference\frequentist.py

def run_welch_t_test(group_a: np.ndarray, group_b: np.ndarray) -> dict:
    r"""Performs Welch's t-test for difference of means with unequal variances.

    Welch's t-test is a two-sample location test used to test the hypothesis that two populations have equal means
    ($H_0: \\mu_A = \\mu_B$). Unlike Student's t-test, Welch's t-test does not assume equal variances,
    making it the standard default for digital and scientific A/B testing.

    Args:
        group_a (np.ndarray): Array of numeric outcomes for control (Group A).
        group_b (np.ndarray): Array of numeric outcomes for treatment (Group B).

    Returns:
        dict: A dictionary containing:
            - `"t_statistic"` (float): The calculated Welch's t-value.
            - `"p_value"` (float): The two-sided p-value.
            - `"df"` (float): The approximated Satterthwaite degrees of freedom.
            - `"difference"` (float): Absolute difference between means ($\\bar{X}_B - \\bar{X}_A$).
    """
    from scipy import stats

    val_a = group_a[~np.isnan(group_a)]
    val_b = group_b[~np.isnan(group_b)]

    n_a = len(val_a)
    n_b = len(val_b)

    if n_a < 2 or n_b < 2:
        return {
            "t_statistic": 0.0,
            "p_value": 1.0,
            "df": float(n_a + n_b - 2),
            "difference": 0.0
        }

    mean_a = np.mean(val_a)
    mean_b = np.mean(val_b)
    var_a = np.var(val_a, ddof=1)
    var_b = np.var(val_b, ddof=1)

    se_diff = np.sqrt(var_a / n_a + var_b / n_b)
    diff = mean_b - mean_a

    if se_diff > 0.0:
        t_stat = diff / se_diff
        num = (var_a / n_a + var_b / n_b) ** 2
        den = ((var_a / n_a) ** 2) / (n_a - 1) + ((var_b / n_b) ** 2) / (n_b - 1)
        df = num / den if den > 0 else (n_a + n_b - 2)
        p_val = 2 * (1.0 - stats.t.cdf(np.abs(t_stat), df=df))
    else:
        t_stat = 0.0
        p_val = 1.0
        df = float(n_a + n_b - 2)

    return {
        "t_statistic": float(t_stat),
        "p_value": float(p_val),
        "df": float(df),
        "difference": float(diff)
    }

run_mann_whitney_u

run_mann_whitney_u(
    group_a: ndarray, group_b: ndarray
) -> dict

Performs nonparametric Mann-Whitney U test for ordinal or non-normal continuous data.

The Mann-Whitney U test evaluates the null hypothesis that the probability that a randomly drawn observation from Group B is larger than a randomly drawn observation from Group A is equal to 0.5. This test is non-parametric; it does not assume normality, making it extremely robust against extreme outliers.

PARAMETER	DESCRIPTION
`group_a`	Array of numeric outcomes for control (Group A). TYPE: `ndarray`
`group_b`	Array of numeric outcomes for treatment (Group B). TYPE: `ndarray`

RETURNS	DESCRIPTION
`dict`	A dictionary containing: - `"u_statistic"` (float): The calculated Mann-Whitney U-value. - `"p_value"` (float): The two-sided asymptotic p-value. TYPE: `dict`

Source code in src\xpyrment\analyze\inference\frequentist.py

def run_mann_whitney_u(group_a: np.ndarray, group_b: np.ndarray) -> dict:
    """Performs nonparametric Mann-Whitney U test for ordinal or non-normal continuous data.

    The Mann-Whitney U test evaluates the null hypothesis that the probability that a randomly drawn
    observation from Group B is larger than a randomly drawn observation from Group A is equal to 0.5.
    This test is non-parametric; it does not assume normality, making it extremely robust against extreme outliers.

    Args:
        group_a (np.ndarray): Array of numeric outcomes for control (Group A).
        group_b (np.ndarray): Array of numeric outcomes for treatment (Group B).

    Returns:
        dict: A dictionary containing:
            - `"u_statistic"` (float): The calculated Mann-Whitney U-value.
            - `"p_value"` (float): The two-sided asymptotic p-value.
    """
    from scipy import stats

    val_a = group_a[~np.isnan(group_a)]
    val_b = group_b[~np.isnan(group_b)]

    if len(val_a) == 0 or len(val_b) == 0:
        return {
            "u_statistic": 0.0,
            "p_value": 1.0
        }

    res = stats.mannwhitneyu(val_b, val_a, alternative="two-sided")

    return {
        "u_statistic": float(res.statistic),
        "p_value": float(res.pvalue)
    }

route_inference_engine

route_inference_engine(
    metric: BaseMetric, design_type: str
) -> str

Routes to the correct statistical engine based on metric and design types.

Selects the mathematically correct statistical test and variance estimator depending on the structural type of the metric (continuous mean, ratio of random variables, binary rate) and the user's desired inferential paradigm (Frequentist, Bayesian, Sequential, or Bootstrap).

Inference Routing Matrix

Pseudocode for Routing Logic

function route_inference_engine(metric, paradigm):
    1. Extract metric_class = metric.__class__.__name__ (MeanMetric, RatioMetric, etc.)
    2. Match paradigm:
         - "frequentist":
             If MeanMetric -> return "frequentist_welch_t_test"
             If RatioMetric -> return "frequentist_ratio_delta_test"
         - "bayesian":
             If BinaryMetric -> return "bayesian_beta_binomial"
             If MeanMetric -> return "bayesian_normal_normal"
         - "sequential":
             return "sequential_msprt"
         - "bootstrap":
             return "bootstrap_resampling"
    3. Raise error if the combination is mathematically invalid.

PARAMETER	DESCRIPTION
`metric`	The target metric object (MeanMetric, RatioMetric, etc.) under analysis. TYPE: `BaseMetric`
`design_type`	The desired statistical framework. Options: `"frequentist"`, `"bayesian"`, `"sequential"`, `"bootstrap"`. TYPE: `str`

RETURNS	DESCRIPTION
`str`	Engine label string identifying the specific low-level calculation function. TYPE: `str`

Source code in src\xpyrment\analyze\inference\router.py

def route_inference_engine(metric: BaseMetric, design_type: str) -> str:
    """Routes to the correct statistical engine based on metric and design types.

    Selects the mathematically correct statistical test and variance estimator depending on the structural type
    of the metric (continuous mean, ratio of random variables, binary rate) and the user's desired inferential paradigm
    (Frequentist, Bayesian, Sequential, or Bootstrap).

    Inference Routing Matrix:
        +------------------+-----------------------------+-----------------------------+-------------------------------+
        | Metric Type      | Frequentist                 | Bayesian                    | Sequential (mSPRT)            |
        +------------------+-----------------------------+-----------------------------+-------------------------------+
        | **MeanMetric**   | Welch's t-test (CLT)        | Normal-Normal Conjugate /   | Continuous Normal Likelihood  |
        |                  |                             | Gibbs Sampler               | Ratio Martingale              |
        +------------------+-----------------------------+-----------------------------+-------------------------------+
        | **RatioMetric**  | Delta Method t-test         | Normal-Normal Delta /       | Ratio-Delta Likelihood        |
        |                  | (Taylor approximation)      | MCMC Sampler                | Ratio Martingale              |
        +------------------+-----------------------------+-----------------------------+-------------------------------+
        | **BinaryMetric** | Pearson Z-test (Wald CI)    | Beta-Binomial Conjugate     | Bernoulli Likelihood          |
        |                  | / Fisher's Exact Test       | (Exact posterior)           | Ratio Martingale              |
        +------------------+-----------------------------+-----------------------------+-------------------------------+

    Pseudocode for Routing Logic:
        ```text
        function route_inference_engine(metric, paradigm):
            1. Extract metric_class = metric.__class__.__name__ (MeanMetric, RatioMetric, etc.)
            2. Match paradigm:
                 - "frequentist":
                     If MeanMetric -> return "frequentist_welch_t_test"
                     If RatioMetric -> return "frequentist_ratio_delta_test"
                 - "bayesian":
                     If BinaryMetric -> return "bayesian_beta_binomial"
                     If MeanMetric -> return "bayesian_normal_normal"
                 - "sequential":
                     return "sequential_msprt"
                 - "bootstrap":
                     return "bootstrap_resampling"
            3. Raise error if the combination is mathematically invalid.
        ```

    Args:
        metric (BaseMetric): The target metric object (MeanMetric, RatioMetric, etc.) under analysis.
        design_type (str): The desired statistical framework. Options: `"frequentist"`, `"bayesian"`,
            `"sequential"`, `"bootstrap"`.

    Returns:
        str: Engine label string identifying the specific low-level calculation function.
    """
    # TODO: Implement full intelligent router
    return "frequentist_t_test"