Uncertainty Quantification¶

Calibrated prediction intervals, quantile forecasts, and probabilistic evaluation — alternatives to (or augmentations of) our heuristic confidence score at processing.py:1041.

title: ETS/ARIMA prediction intervals tags: [uncertainty, prediction-intervals, parametric] applies_to: [tier_2, tier_3] data_needs: "Same as point forecast (≥24mo preferred for ETS/ARIMA); residual normality for analytic intervals to be calibrated" status: candidate

ETS/ARIMA prediction intervals¶

Source: Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), ch.8 (ETS) and ch.9 (ARIMA) Link: https://otexts.com/fpp3/ Retrieved: 2026-05-15

What it is: Both ETS and ARIMA are state-space models whose forecast variance is derived analytically from the fitted parameters and innovation variance. For a horizon h, the variance grows in a model-specific way (e.g., for ARIMA(0,1,0), variance grows linearly with h; for ETS with multiplicative errors, intervals widen and skew). Under a normality assumption, analytic ±1.28σ / ±1.96σ bands give 80% / 95% intervals.

When to use: - You already have an ETS or ARIMA fit (we do — AutoCES, HW add/mul) and want "free" intervals from the same model. - Residuals are roughly stationary and approximately normal at the relevant scale. - You need a fast, deterministic interval at no extra compute cost.

Fit for our model: - ✅ Drop-in for our existing StatsForecast ensemble (processing.py:1984) — forecast(..., level=[80, 95]) returns lo/hi columns alongside the point forecast. - ⚠ Multi-model selection by SMAPE (processing.py:2016) picks the best point-forecast model; its intervals may be miscalibrated if residuals are non-normal (heavy tails are common for keyword volume). - ⚠ Doesn't apply to our Theta / heuristic fallbacks for short-history tiers. - 🔧 statsforecast.StatsForecast.forecast(level=[80, 95]) — already supported by AutoCES, HoltWinters, AutoARIMA. Pair with calibration plots to verify empirical coverage.

title: Bootstrap prediction intervals tags: [uncertainty, prediction-intervals, nonparametric, resampling] applies_to: [tier_2, tier_3] data_needs: "Fitted point-forecast model + ≥50 in-sample residuals to resample; assumes residuals are i.i.d. (or at least exchangeable)" status: candidate

Bootstrap prediction intervals¶

Source: Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), §5.5 Link: https://otexts.com/fpp3/prediction-intervals.html Retrieved: 2026-05-15

What it is: Instead of assuming Gaussian errors, simulate many future paths by repeatedly sampling residuals (with replacement) from the in-sample fit and rolling them forward through the model's recursion. The empirical quantiles of the simulated paths (e.g., 10th and 90th percentiles) form a nonparametric prediction interval. Captures asymmetry and heavy tails the analytic intervals miss.

When to use: - Residuals are clearly non-normal (skewed, heavy-tailed, heteroskedastic at moderate degree). - You want a model-agnostic interval that wraps around an existing point forecaster (Theta, SES, custom). - You can afford 100-1000 simulations per series at forecast time.

Fit for our model: - ✅ Wraps any of our ensemble members (processing.py:1984) — including the Theta-style heuristics where analytic intervals don't exist. - ✅ Naturally handles the skewed-positive nature of keyword volumes (no negative-forecast clipping needed if residuals respect non-negativity). - ⚠ 1000 simulations × 200 shards × millions of keywords is a real CPU cost; may need to subsample or restrict to high-tier keywords. - 🔧 statsforecast.StatsForecast.forecast(..., level=[80,95], fitted=True) with bootstrapping support; or roll your own residual resampling around the chosen ensemble winner from processing.py:2016.

title: Conformal prediction tags: [uncertainty, distribution-free, calibration, prediction-intervals] applies_to: [tier_1, tier_2, tier_3] data_needs: "A held-out calibration set (e.g., last 6-12 months per keyword, or pooled across keywords); a base point forecaster" status: candidate

Conformal prediction¶

Source: Shafer & Vovk 2008, "A Tutorial on Conformal Prediction"; Stankevičiūtė, Alaa & van der Schaar 2021, "Conformal Time-Series Forecasting" Link: https://arxiv.org/abs/0706.3188 (Shafer & Vovk); https://arxiv.org/abs/2108.04618 (Stankevičiūtė et al.) Retrieved: 2026-05-15

What it is: A distribution-free method that turns any black-box point forecaster into a prediction interval with a finite-sample coverage guarantee. The idea: on a calibration set, measure the empirical nonconformity scores (e.g., |y − ŷ|), take the (1 − α)-quantile, and use it to widen the point forecast. Under exchangeability, the interval covers the truth with probability ≥ 1 − α — no parametric assumption needed. Time-series variants (Stankevičiūtė 2021, Xu & Xie 2023 EnbPI) address the non-exchangeability of forecasts.

When to use: - You don't trust the parametric residual distribution of your forecaster (skewed, heavy-tailed, heteroskedastic by tier). - You have many keywords with comparable error structure → pool calibration data for tight intervals. - You want a single, defensible coverage guarantee to publish alongside the volume estimate.

Fit for our model: - ✅ Pools naturally across keywords within a tier — directly addresses the "single calibrated number" gap in our heuristic confidence score (processing.py:1041). - ✅ Works on top of whatever model the SMAPE selection picks (processing.py:2016), including Theta / heuristic tiers. - ⚠ Coverage guarantee requires exchangeability; with drift (GT collapse, level shifts) the guarantee weakens. EnbPI / adaptive conformal (Gibbs & Candès 2021) handles this. - 🔧 mlforecast and statsforecast from Nixtla both expose conformal intervals via prediction_intervals=PredictionIntervals(...). See https://nixtlaverse.nixtla.io/.

title: Conformalized quantile regression (CQR) tags: [uncertainty, conformal, quantile-regression, calibration] applies_to: [tier_2, tier_3] data_needs: "Calibration set + a quantile regression model (e.g., GBM with quantile loss); enough data per series or pooled features" status: candidate

Conformalized quantile regression (CQR)¶

Source: Romano, Patterson & Candès 2019, "Conformalized Quantile Regression" Link: https://arxiv.org/abs/1905.03222 Retrieved: 2026-05-15

What it is: Hybrid of quantile regression and conformal prediction. First fit lower and upper quantile regressors (say τ = 0.05, 0.95) on the training set. Then, on a held-out calibration set, conformalize the gap: shift the interval bounds outward by the empirical quantile of the conformity score (max of how far the truth lies above the upper or below the lower predicted quantile). This corrects miscalibration of the quantile regressor while inheriting its ability to produce heteroskedastic (varying-width) intervals.

When to use: - Interval width should adapt to the keyword's regime (volatile vs stable) rather than be uniformly widened. - You have engineered features (lagged volumes, source counts, tier, seasonality strength) per keyword to drive a regression. - You care about the coverage guarantee, but pure conformal gives intervals that are too wide for stable series.

Fit for our model: - ✅ Heteroskedasticity matters for us: a high-volume stable keyword shouldn't share interval width with a sparse, bursty one (processing.py:1180). - ⚠ Needs feature engineering and a regression model — heavier than the analytic ETS intervals we'd get for free. - ⚠ Implementation lift: not built into StatsForecast; needs lightgbm/xgboost with quantile loss + custom calibration step. - 🔧 mapie (https://mapie.readthedocs.io/) implements CQR with MapieQuantileRegressor. Pair with lightgbm quantile objective.

title: Quantile regression forecasts (MQRNN, etc.) tags: [uncertainty, quantile-forecasts, deep-learning] applies_to: [tier_3] data_needs: "Long histories preferred (>=36mo); large training corpus across many series; GPU for training (CPU inference OK)" status: candidate

Quantile regression forecasts (MQRNN, etc.)¶

Source: Wen, Torkkola, Narayanaswamy & Madeka 2017, "A Multi-Horizon Quantile Recurrent Forecaster" (MQRNN) Link: https://arxiv.org/abs/1711.11053 Retrieved: 2026-05-15

What it is: Train a single neural network whose output layer emits a vector of quantiles (e.g., 0.1, 0.5, 0.9) at every horizon, optimized under pinball loss. The model directly produces the full predictive distribution without parametric assumption. MQRNN uses an RNN encoder + MLP decoder; later variants (TFT, N-BEATS-Q) keep the same idea.

When to use: - You have many similar series and want one global model to share statistical strength. - You need a full quantile picture (e.g., P10/P50/P90 each month) rather than just an interval. - You want a single neural net to handle both point and probabilistic forecasts.

Fit for our model: - ✅ Global model would let us share signal across keywords in a tier — useful for tier_1/tier_2 with short history. - ⚠ Major infra shift: GPU training, model versioning, batched inference per shard. Our current pipeline is CPU-bound statistical per shard. - ⚠ M4/M5 results: deep quantile models beat statistical baselines only with careful tuning and large datasets; gains over conformalized statsforecast are modest. - 🔧 gluonts.torch.model.mqf2 (MultiQuantileForecaster), darts.models.RNNModel(likelihood=QuantileRegression(...)), pytorch-forecasting.DeepAR/TFT with QuantileLoss. See TFT and DeepAR.

title: Probabilistic evaluation metrics (CRPS, pinball loss) tags: [uncertainty, metrics, calibration, scoring-rules] applies_to: [tier_1, tier_2, tier_3] data_needs: "A probabilistic forecast (samples or quantiles) and held-out actuals" status: candidate

Probabilistic evaluation metrics (CRPS, pinball loss)¶

Source: Gneiting & Raftery 2007, "Strictly Proper Scoring Rules, Prediction, and Estimation" Link: https://www.tandfonline.com/doi/abs/10.1198/016214506000001437 (paper); https://ts.gluon.ai/stable/api/gluonts/gluonts.evaluation.html (GluonTS metrics) Retrieved: 2026-05-15

What it is: Strictly proper scoring rules reward both calibration and sharpness. Pinball loss at quantile τ is max(τ(y − ŷτ), (τ − 1)(y − ŷτ)); averaging over τ ∈ (0,1) gives CRPS (Continuous Ranked Probability Score), which generalizes MAE to probabilistic forecasts and reduces to MAE for a point forecast. CRPS lets you compare a point-forecast model against a probabilistic one on a common scale.

When to use: - You're moving from point-only forecasts to probabilistic ones and need a single number to track in backtests. - You want to compare a conformal interval against ETS analytic intervals on the same series. - You want a metric that penalizes overconfident and under-confident intervals (SMAPE does neither).

Fit for our model: - ✅ Replacing/augmenting CV-normalized SMAPE (processing.py:1104) with CRPS would directly reward calibration in the forecast-reliance subscore of compute_confidence_score() (processing.py:1041). - ✅ Pinball at fixed τ is easy to compute and can be a regularizer if we move to quantile regression. - ⚠ Requires a probabilistic forecast to compute; if we keep a point forecast, CRPS reduces to MAE. - 🔧 gluonts.evaluation.MultivariateEvaluator (with quantiles=[0.1,...,0.9]), properscoring.crps_ensemble, sklearn.metrics.mean_pinball_loss.

title: Bayesian posterior predictive intervals tags: [uncertainty, bayesian, prior, posterior-predictive] applies_to: [tier_1, tier_2, tier_3] data_needs: "A specified Bayesian model (priors over level/trend/seasonality); MCMC or VI sampler; longer compute time" status: candidate

Bayesian posterior predictive intervals¶

Source: Gelman et al. Bayesian Data Analysis (3rd ed.); Salvatier, Wiecki & Fonnesbeck 2016 (PyMC); Phan, Pradhan & Jankowiak 2019 (NumPyro) Link: https://www.pymc.io/ ; https://num.pyro.ai/ ; https://orbit-ml.readthedocs.io/ (Uber Orbit) Retrieved: 2026-05-15

What it is: Specify a generative model (e.g., level + trend + seasonality + noise) with priors on each component, then sample the posterior over parameters via MCMC (NUTS) or VI. The posterior predictive distribution integrates over parameter uncertainty and future innovation noise to produce honest forecast quantiles. Naturally accommodates partial pooling (one prior across keyword groups), missing data, and structural shifts.

When to use: - You want to share information across similar keywords via hierarchical priors (Tier 1 short-history keywords benefit most). - The forecast feeds a downstream decision where calibration matters more than speed. - You can articulate informative priors (e.g., "trend rarely flips sign for branded queries").

Fit for our model: - ✅ Tightly cross-links to BSTS — same posterior-predictive machinery. - ✅ Best fit for high-value Tier 1 keywords where the heuristic at processing.py:1247 is weakest. - ⚠ MCMC is 10-100x slower than StatsForecast; would require selective application (top-N keywords or a daily job). - 🔧 pymc for full MCMC; numpyro for fast NUTS on JAX; orbit-ml for production-grade BSTS-style models. Use pm.sample_posterior_predictive (PyMC) or Predictive(...)(rng, *args) (NumPyro).

title: Calibration plots and reliability diagrams tags: [uncertainty, diagnostics, calibration, evaluation] applies_to: [tier_1, tier_2, tier_3] data_needs: "Held-out forecasts with interval bounds (or quantiles) and matching actuals; ideally many keywords for binning" status: candidate

Calibration plots and reliability diagrams¶

Source: Gneiting, Balabdaoui & Raftery 2007, "Probabilistic Forecasts, Calibration and Sharpness" Link: https://doi.org/10.1111/j.1467-9868.2007.00587.x Retrieved: 2026-05-15

What it is: A reliability diagram plots nominal coverage (e.g., 80%) against empirical coverage (fraction of actuals that fell inside the 80% interval) for many nominal levels. A perfectly calibrated forecaster lies on the diagonal. For quantile forecasts you can plot τ vs. empirical CDF P(y ≤ ŷτ). Combine with a sharpness measure (mean interval width) — a forecaster can be perfectly calibrated but useless if intervals are 0-to-infinity.

When to use: - After adopting any of ETS intervals / bootstrap / conformal / quantile methods — to verify they actually deliver the advertised coverage. - Monitoring: re-run weekly on a backtest window to detect drift in calibration over time. - Per-tier diagnostics: calibration may be good for Tier 3 but broken for Tier 1.

Fit for our model: - ✅ A reliability diagram per tier would directly let us audit a calibrated replacement for compute_confidence_score() at processing.py:1041. - ✅ Cheap to compute — just a binning + counting exercise on backtest output. - ⚠ Needs a meaningful backtest design; our current backtest is per-keyword SMAPE (processing.py:2016), not per-keyword-with-interval. - 🔧 properscoring.crps_quadrature, uncertainty_toolbox (https://github.com/uncertainty-toolbox/uncertainty-toolbox) for plot_calibration and mean_absolute_calibration_error. Combine with CRPS for a sharpness-aware summary.