Short-history methods¶

Forecasting techniques that work with <24 months of data — relevant to Tier 1 (<6mo) and Tier 2 (6-23mo) keywords that currently fall through to MoM/rolling-window heuristics in calculate_hybrid_growth().

title: Theta method tags: [short-history, decomposition, baseline] applies_to: [tier_2, tier_3] data_needs: "≥4-8 observations; works on monthly with 12-24 points; seasonality handled via classical decomposition preprocessing" status: candidate

Theta method¶

Source: Assimakopoulos & Nikolopoulos (2000), International Journal of Forecasting 16(4):521-530; winner of the M3 competition Link: https://www.sciencedirect.com/science/article/abs/pii/S0169207000000662 Retrieved: 2026-05-15

What it is: A univariate forecaster that decomposes a series into two "theta lines" — one with curvature removed (the long-term trend, θ=0) and one with curvature doubled (θ=2, capturing short-term behavior) — then forecasts each line separately (linear regression for θ=0, simple exponential smoothing for θ=2) and averages them. It is mathematically equivalent to SES with drift, but the decomposition framing makes it robust on short series and was the surprise winner of the M3 competition.

When to use: - Short monthly series where AutoCES / Holt-Winters either fail or overfit (10-24 observations). - As a strong, cheap baseline before reaching for ML. - When you want a single fast call with sensible defaults (no hyperparameter tuning).

Fit for our model: - ✅ Drop-in upgrade for Tier 2 (6-23mo) in calculate_hybrid_growth() (processing.py:1247) — replaces the rolling-window heuristic with a principled forecaster that still works at 12 observations. - ✅ Already exposed by our existing stack: statsforecast.models.Theta, OptimizedTheta, AutoTheta. Could be added as a third (or fourth) member to the ensemble at processing.py:1984 with no new dependency. - ⚠ Plain Theta has no built-in seasonal handling — needs classical-decomposition preprocessing for Halloween/Super-Bowl style peaks; the seasonal alt-gate at processing.py:2054 would still be needed for narrow seasonals. - ⚠ The _is_spiky_series heuristic (processing.py:1180) would still need to gate Theta on event-driven series; SES underneath is not designed for zeros. - 🔧 from statsforecast.models import AutoTheta — already a statsforecast model, slots into the existing StatsForecast(models=[...]) ensemble.

title: Bayesian structural time series (BSTS) tags: [short-history, bayesian, state-space, decomposition] applies_to: [tier_2, tier_3] data_needs: "≥12 observations preferred; handles short series via priors; supports regressors (e.g., GT signal, holidays)" status: candidate

Bayesian structural time series (BSTS)¶

Source: Brodersen, Gallusser, Koehler, Remy, Scott (2015), "Inferring causal impact using Bayesian structural time-series models," Annals of Applied Statistics 9(1):247-274 Link: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-9/issue-1/Inferring-causal-impact-using-Bayesian-structural-time-series-models/10.1214/14-AOAS788.full Retrieved: 2026-05-15

What it is: Decomposes a series into latent state components — local level / local linear trend, seasonal cycle(s), regression on covariates — fitted via Bayesian inference (typically Kalman filter + Gibbs sampling or HMC). The Bayesian framing means priors regularize each component, so short series do not blow up the trend estimate, and posterior credible intervals come out of the box. Underpins Google's CausalImpact R package.

When to use: - Short series where you also have covariates (e.g., a related keyword's GT series, a category aggregate, holiday indicators) — BSTS can borrow strength from them. - When you need calibrated uncertainty intervals (downstream P7 problem) alongside the point forecast. - When you suspect a level shift / change point and want to fit it jointly with seasonality.

Fit for our model: - ✅ Natively delivers credible intervals — directly addresses P7 (confidence-as-point-estimate) referenced at processing.py:1041. - ✅ Can ingest GT as a regressor for keywords with weak GSC/JS — could partially replace the max(JS, GSC) blend at processing.py:1763 with a principled fusion. - ⚠ MCMC is slow — too slow to run per-keyword across shards; would need Variational Bayes (e.g., pymc.fit(method="advi")) or a pre-trained shared model. - ⚠ Implementation is a step-change in complexity vs. our current statsforecast stack. - 🔧 Python: statsmodels.tsa.statespace.structural.UnobservedComponents (fast, MLE), pybsts, orbit-ml (LGT/DLT models), or pymc for full Bayesian. Start with UnobservedComponents for a fast pilot — same SciPy stack as the rest of the pipeline.

title: Empirical Bayes priors (from similar keywords) tags: [short-history, bayesian, pooling, priors] applies_to: [tier_1, tier_2] data_needs: "the keyword itself can have ≥1 observation; requires a large reference set of similar keywords with complete history" status: candidate

Empirical Bayes priors (from similar keywords)¶

Source: Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge University Press; classical James-Stein shrinkage Link: https://efron.ckirby.su.domains/other/2010LSIexcerpt.pdf Retrieved: 2026-05-15

What it is: Pool a prior distribution (mean, variance, seasonal amplitude, growth rate) from a large reference set of similar keywords, then use that prior to regularize the estimate for a new short-history keyword. Empirical Bayes estimates the prior's hyperparameters from the pooled data itself (rather than setting them subjectively), so a new keyword's forecast is shrunk toward the cohort average with a weight that grows as you collect more keyword-specific evidence.

When to use: - Tier 1 keywords with 1-5 months of data, where any pure-univariate forecaster will produce noise. - When you can group keywords into reasonably homogeneous cohorts — e.g., (country, SERP feature, category), (country, brand-vs-generic), or via a k-NN on existing features. - As a fast, non-MCMC alternative to a full hierarchical model.

Fit for our model: - ✅ Directly addresses Tier 1 (<6mo) in calculate_hybrid_growth() (processing.py:1247) — instead of an MoM heuristic, the keyword inherits a prior growth rate from its cohort and updates it Bayes-style as months accrue. - ✅ Cheap to compute — pre-aggregate cohort priors once per shard run; per-keyword update is closed-form Normal-Normal or Gamma-Poisson. - ⚠ Cohort construction is the hard part — bad cohorts produce biased priors. May need to reuse similarity signals you already compute (keyword embeddings, SERP overlap, category). - ⚠ Empirical Bayes ignores cohort-level uncertainty in the prior — the hierarchical version is more correct but heavier. - 🔧 Closed-form Normal-Normal shrinkage in pure NumPy, no library needed; for richer models use pymc or numpyro.

title: Hierarchical pooling across similar keywords tags: [short-history, bayesian, hierarchical, pooling] applies_to: [tier_1, tier_2] data_needs: "≥1 observation per keyword; needs a defined hierarchy (e.g., keyword ∈ cohort ∈ category)" status: candidate

Hierarchical pooling across similar keywords¶

Source: Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, Bayesian Data Analysis (3rd ed.), ch. 5; canonical "eight schools" example Link: https://sites.stat.columbia.edu/gelman/book/BDA3.pdf Retrieved: 2026-05-15

What it is: A Bayesian multilevel model where keyword-level parameters (level, trend slope, seasonal amplitude) are drawn from a cohort-level prior, which is itself drawn from a global hyperprior. "Partial pooling" interpolates between two extremes: complete pooling (everyone forecast as the cohort mean — too biased) and no pooling (each keyword forecast independently — too noisy for short series). The pooling weight is learned, not chosen, and shrinks toward the cohort mean exactly as much as the data warrants.

When to use: - When you have a natural hierarchy: keyword → topic → category, or keyword → country. - When you want the cohort uncertainty propagated into the per-keyword forecast (Empirical Bayes treats the cohort prior as known). - When forecasting many short-history keywords at once — pooling makes a large cohort with many short series behave like one long series.

Fit for our model: - ✅ Cleanly handles Tier 1 (<6mo) at processing.py:1247 — keywords with 1mo of data inherit ~cohort mean; with 6mo they start asserting their own signal. - ✅ Cohort-level seasonal amplitude estimates could help the seasonality detector at processing.py:1130 (detect_seasonality() ACF-lag-12) on series too short to detect lag-12 ACF directly. - ⚠ Heavier than Empirical Bayes: needs MCMC (Stan/PyMC/NumPyro) or VI, and per-shard runtime is non-trivial. - ⚠ Hierarchy definition is a modeling choice; getting it wrong silently biases all short-history forecasts. - 🔧 pymc or numpyro for full Bayesian; pymer4 (lme4 port) for a frequentist mixed-effects approximation. For production, train the hierarchical model offline daily and serve posterior means/variances as a lookup table.

title: Transfer learning / meta-learning for forecasting tags: [short-history, meta-learning, ensemble-selection] applies_to: [tier_1, tier_2, tier_3] data_needs: "feature-extractable short history (≥4 observations); large historical pool to train the meta-learner on" status: candidate

Transfer learning / meta-learning for forecasting¶

Source: Montero-Manso, Athanasopoulos, Hyndman, Talagala (2020), "FFORMA: Feature-based forecast model averaging," International Journal of Forecasting 36(1):86-92 (2nd place M4); Smyl (2020), "A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting" (ES-RNN, M4 winner) Link: https://robjhyndman.com/papers/fforma.pdf Retrieved: 2026-05-15

What it is: Two paradigms. FFORMA extracts a fixed vector of time-series features (spectral entropy, ACF lags, stl-strength, trend slope, etc.) from each short series and trains a meta-learner (gradient-boosted trees) on a large historical pool to predict the optimal ensemble weights of base forecasters (ETS, ARIMA, Theta, …) for each series. ES-RNN is a hybrid that uses classical exponential smoothing to remove level/trend/seasonality per-series, then trains a shared LSTM across all series to learn residual patterns — directly transferring knowledge across series of all lengths.

When to use: - When you have many time series and want one model to "learn" which forecaster works for which keyword shape. - When short series share dynamics with a small set of long series ("transfer" the long series' learned trend to the short series). - Especially attractive for a per-keyword pipeline like ours — train once per refresh on the full historical pool, apply per-keyword at inference.

Fit for our model: - ✅ Per-shard, per-keyword runtime at inference is just feature extraction + a GBM scoring call — affordable inside calculate_hybrid_growth() (processing.py:1247). - ✅ Could replace the brittle tier boundaries (tier_1<6, tier_2=6-23, tier_3≥24) with a smooth, learned choice of forecaster weights driven by the actual series properties. - ⚠ Training the meta-learner needs a holdout-of-keywords + holdout-of-future-months setup; non-trivial offline ML pipeline to maintain. - ⚠ ES-RNN specifically requires a GPU for training (we have these on bigmac/kopi/oreo); inference is cheap. - 🔧 statsforecast does not yet ship FFORMA, but the M4 winning code is reproducible via tsfeatures (feature extraction) + lightgbm (meta-learner); ES-RNN is in esrnn_torch / neuralforecast. See also Nixtla's TimeGPT as a foundation-model alternative (cross-link with the modern_ml.md page when present).

title: Naive seasonal + trend (sNaive baselines) tags: [short-history, baseline, benchmark] applies_to: [tier_1, tier_2, tier_3] data_needs: "≥1 full season (12 months for sNaive); ≥2 observations for naive/drift" status: candidate

Naive seasonal + trend (sNaive baselines)¶

Source: Hyndman & Athanasopoulos, Forecasting: Principles and Practice (3rd ed.), ch. 5.2 Link: https://otexts.com/fpp3/simple-methods.html Retrieved: 2026-05-15

What it is: A family of trivially simple forecasters: naive (forecast = last observation), seasonal naive / sNaive (forecast for month M = last observation in month M, i.e., last year's value), and drift (last value + average per-period change × horizon). Crucial as benchmarks — any proposed forecaster that does not beat sNaive on seasonal data, or drift on trending data, is not worth deploying. For monthly volume data with 12-24 months of history, sNaive is often shockingly competitive.

When to use: - Always: as the lower bound that every new method must clear, computed and logged alongside the ensemble pick. - When data is so short that any model with parameters overfits — sNaive has zero parameters. - As the floor for the "frozen months" prior at processing.py:2122.

Fit for our model: - ✅ Should be run at every tier as a benchmark — if the StatsForecast ensemble (processing.py:1984) cannot beat sNaive on backtest, prefer sNaive. The seasonal alt-gate (processing.py:2054) already implicitly trusts last year's shape; sNaive makes that trust explicit. - ✅ For Tier 1 keywords with <12mo, drift method is the principled MoM heuristic — exactly the kind of replacement the current Tier 1 path needs at processing.py:1247. - ⚠ sNaive ignores trend; if the keyword is genuinely growing/declining, it will systematically lag. - ⚠ Not a "method to use exclusively" — its value is as a benchmark and as a low-quantile floor when other models are unreliable. - 🔧 statsforecast.models.SeasonalNaive, Naive, RandomWalkWithDrift — already available in the existing stack with zero new dependencies. Add to the ensemble at processing.py:1984 and use as a fallback when the chosen model's backtest SMAPE is worse than sNaive's.

title: Croston/SBA as a short-history baseline tags: [short-history, intermittent, baseline] applies_to: [tier_1, tier_2] data_needs: "≥1 non-zero observation; behaves sensibly even with many zeros" status: candidate

Croston/SBA as a short-history baseline¶

Source: Croston (1972), Operational Research Quarterly 23(3):289-303; Syntetos & Boylan (2005) SBA bias correction. Full entries on the intermittent methods page. Link: https://nixtlaverse.nixtla.io/statsforecast/docs/models/crostonsba.html Retrieved: 2026-05-15

What it is: Croston / SBA smooth the demand size and the inter-arrival time separately, so they degrade gracefully on series with many zero-months — including young keywords that have only just started receiving traffic. The SBA variant fixes Croston's known positive bias and is a strict improvement.

When to use: - Tier 1 / Tier 2 keywords that are young AND sparse — e.g., a keyword that started getting traffic 4 months ago but has only had non-zero volume in 2 of those months. - As a baseline for any new keyword where SES/Holt-Winters would silently extrapolate noise.

Fit for our model: - ✅ Slots into Tier 1 of calculate_hybrid_growth() (processing.py:1247) as a sane fallback for low-volume new keywords with zero-padded months. - ✅ Same statsforecast stack — zero new dependencies — and pairs naturally with the _is_spiky_series gate at processing.py:1180: instead of suppressing growth on sparse series, we'd forecast it. - ⚠ Output is a flat per-month rate, not a seasonal pattern — only useful as the level component; pair with sNaive for the seasonal shape if needed. - 🔧 See statsforecast.models.CrostonSBA. Full details and cross-method comparison on the intermittent methods page.