Modern ML and Foundation Models¶

Deep learning forecasters and pretrained time-series foundation models — candidate replacements/augmentations for the StatsForecast ensemble at processing.py:1984. Honest verdict up front: M4 (2018) and M5 (2020) both showed that well-tuned statistical models with light ensembling are very hard to beat on monthly data with short-to-medium history; the strongest deep methods (N-BEATS, TFT) win by single-digit percentage points and only with large training corpora. Foundation models change the calculus only if zero-shot accuracy on our distribution beats our ensemble net of serving cost.

title: NeuralProphet tags: [modern-ml, deep-learning, prophet, pytorch] applies_to: [tier_2, tier_3] data_needs: "Monthly history (works with ≥24mo per series); CPU acceptable for short series, GPU helpful for many series" status: candidate

NeuralProphet¶

Source: Triebe, Hewamalage, Pilyugina, Laptev, Bergmeir & Rajagopal 2021, "NeuralProphet: Explainable Forecasting at Scale" Link: https://arxiv.org/abs/2111.15397 ; https://neuralprophet.com/ Retrieved: 2026-05-15

What it is: A PyTorch-based reimplementation of Facebook Prophet that adds an autoregressive (AR-Net) component on top of Prophet's trend + seasonality + holiday + regressor decomposition. Keeps the interpretable additive structure of Prophet but lets the AR term learn lagged dependencies and the regressor terms be parameterized as small NNs. Supports per-series fitting (like Prophet) or global fitting across many series.

When to use: - You like Prophet's interpretability (decompose into trend/seasonality/holiday) but find pure Prophet over-smooths recent dynamics. - You want to add holiday/event regressors (cf. Prophet holiday/event regressors) for narrow-seasonal keywords (Halloween, Super Bowl). - You want one library that handles both per-series and global training without leaving Python.

Fit for our model: - ✅ Could replace the per-keyword fit half of our ensemble (processing.py:1984) with a model that better handles holiday spikes — addresses the narrow-seasonal failure mode at processing.py:2054. - ⚠ Empirically, NeuralProphet is similar-or-worse than well-tuned ETS/ARIMA on monthly data with <60 points; AR-Net helps more on daily. - ⚠ Adds PyTorch dependency to the pipeline; per-series fit cost is higher than StatsForecast. - 🔧 from neuralprophet import NeuralProphet; supports predict(..., quantiles=[0.05, 0.5, 0.95]) for quantile output.

title: N-BEATS tags: [modern-ml, deep-learning, global-model, pytorch] applies_to: [tier_3] data_needs: "Large training corpus across many series; ≥36mo per series helpful; GPU for training, CPU OK for inference" status: candidate

N-BEATS¶

Source: Oreshkin, Carpov, Chapados & Bengio 2019, "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting" Link: https://arxiv.org/abs/1905.10437 Retrieved: 2026-05-15

What it is: A pure-MLP architecture (no RNN/CNN/attention) of stacked "blocks" with backward (backcast) and forward (forecast) projections through basis-function decomposition (trend = polynomial basis, seasonality = Fourier basis, or generic learned basis). Trained globally across many series. Won the M4 competition by a small margin over statistical baselines.

When to use: - Many series available for joint training (we have millions of keywords — fits well). - You want a global model that learns generic patterns rather than per-series fits. - Interpretable decomposition matters and you want the trend/seasonal basis variant.

Fit for our model: - ✅ M4 winner — has the strongest published evidence among deep methods for monthly data. - ⚠ Gains over a tuned StatsForecast ensemble on M4 monthly were single-digit %; ensemble of N-BEATS + ETS was best, suggesting it augments rather than replaces (processing.py:1984). - ⚠ Training a global model across our keyword distribution requires curation: language/tier/intent stratification or a single model may learn dominant-segment biases. - 🔧 neuralforecast (Nixtla): from neuralforecast.models import NBEATS. darts: from darts.models import NBEATSModel.

title: N-HiTS tags: [modern-ml, deep-learning, global-model, hierarchical, pytorch] applies_to: [tier_3] data_needs: "Large training corpus; ≥36mo per series; GPU for training" status: candidate

N-HiTS¶

Source: Challu, Olivares, Oreshkin, Garza, Mergenthaler & Dubrawski 2022, "N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting" Link: https://arxiv.org/abs/2201.12886 Retrieved: 2026-05-15

What it is: Successor to N-BEATS that addresses N-BEATS's poor long-horizon scaling. Adds multi-rate signal sampling (max-pool the input at increasing scales) and hierarchical interpolation (each block forecasts at a different resolution, then interpolates up). Same MLP-block design, but ~50× faster than N-BEATS and stronger on long horizons (h ≥ 24).

When to use: - Long horizons matter (we forecast 12+ months ahead). - You're already considering N-BEATS — N-HiTS is the strict upgrade in compute and accuracy on long horizons per the paper. - You want the speed (faster training and inference makes it practical to retrain monthly).

Fit for our model: - ✅ Long-horizon focus aligns with our 12-mo+ forecast window where the decay floor currently kicks in (processing.py:2100). - ⚠ Same caveats as N-BEATS: global training across our distribution needs care. - ⚠ Gains over a strong statistical ensemble on monthly data are still modest; published evidence is mostly on longer-resolution datasets. - 🔧 neuralforecast (Nixtla): from neuralforecast.models import NHITS. Native support for conformal prediction intervals.

title: Temporal Fusion Transformer (TFT) tags: [modern-ml, deep-learning, attention, transformer, global-model] applies_to: [tier_3] data_needs: "Large corpus; covariates (known-future and observed); ≥36mo per series; GPU" status: candidate

Temporal Fusion Transformer (TFT)¶

Source: Lim, Arık, Loeff & Pfister 2021, "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" Link: https://arxiv.org/abs/1912.09363 Retrieved: 2026-05-15

What it is: Attention-based encoder-decoder that combines LSTM local processing with self-attention for long-range patterns. Distinguishes three covariate kinds: static (per-series metadata), known-future (calendar, planned events), and observed (lagged actuals). Outputs quantiles directly under pinball loss. Variable-selection networks provide per-step interpretability (which covariate mattered).

When to use: - You have rich per-keyword metadata to use as static covariates (intent, vertical, language, tier). - You want a single model that emits quantile forecasts and interpretability for free. - You can build a known-future calendar (holidays, scheduled events) as exogenous input.

Fit for our model: - ✅ Tier + language + intent as static covariates could let one model handle all our keyword segments — addresses the per-tier heuristic split at processing.py:1247. - ✅ Native quantile output integrates cleanly with CRPS / pinball evaluation. - ⚠ Substantial training infra; per-series inference cost is non-trivial at 200-shard scale. - ⚠ M5 and follow-up benchmarks: TFT wins on some tasks, ties on others; not a clear dominator over N-HiTS for monthly volume. - 🔧 pytorch-forecasting.TemporalFusionTransformer, darts.models.TFTModel, gluonts.torch.model.tft.TemporalFusionTransformerEstimator.

title: DeepAR tags: [modern-ml, deep-learning, rnn, autoregressive, probabilistic, global-model] applies_to: [tier_3] data_needs: "Large corpus of related series; ≥24mo per series; GPU for training" status: candidate

DeepAR¶

Source: Salinas, Flunkert, Gasthaus & Januschowski 2017/2020, "DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks" Link: https://arxiv.org/abs/1704.04110 Retrieved: 2026-05-15

What it is: A GRU/LSTM-based autoregressive model with parameters shared across all series. Predicts the parameters of a parametric distribution (Gaussian, Negative Binomial, Student-t) at each future step rather than a point. Trained by teacher-forcing on lagged actuals; forecasts by sampling many trajectories and reporting quantiles. The first widely-deployed deep probabilistic forecaster (AWS Forecast).

When to use: - Many related series and the cross-series patterns should transfer (e.g., same keyword family across languages). - You explicitly want sample-based forecasts (e.g., Negative Binomial for count-valued keyword impressions). - You can hand-pick the output distribution to match your data (positive-only, integer, heavy-tailed).

Fit for our model: - ✅ Negative Binomial / lognormal output naturally fits keyword volumes (positive, over-dispersed, skewed). - ✅ Cross-series sharing is most beneficial for short-history Tier 1/2 keywords (processing.py:1247). - ⚠ M4/M5: DeepAR was outperformed by N-BEATS and statistical baselines on monthly data; shines more on hourly/daily with strong seasonality. - ⚠ Older architecture; modern foundation models (Chronos, TimesFM) subsume the same global-RNN idea with much larger pretraining. - 🔧 gluonts.torch.model.deepar.DeepAREstimator, pytorch-forecasting.DeepAR, darts.models.RNNModel(model='LSTM', likelihood=GaussianLikelihood()).

title: TimeGPT (Nixtla) tags: [modern-ml, foundation-model, zero-shot, api, closed-source] applies_to: [tier_2, tier_3] data_needs: "≥24 data points per series (Nixtla's recommendation); API access; optional fine-tuning data" status: candidate

TimeGPT (Nixtla)¶

Source: Garza & Mergenthaler-Canseco 2023, "TimeGPT-1" Link: https://arxiv.org/abs/2310.03589 ; https://docs.nixtla.io/ Retrieved: 2026-05-15

What it is: Closed-source foundation model from Nixtla, accessible via an API (nixtla Python SDK). Pretrained on a large corpus of time-series; supports zero-shot forecasting with no per-series fitting required. Returns point forecasts and conformal prediction intervals. Supports fine-tuning on the user's data, exogenous variables, and anomaly detection in the same SDK.

When to use: - You want to spin up a credible forecast on many series with no engineering. - You're OK with sending data to a third-party API (consider PII / business-sensitivity for keyword data). - You want a fast bake-off against your in-house ensemble before committing to deeper ML investment.

Fit for our model: - ✅ Zero-shot evaluation on a sample of our keywords would be a cheap experiment to see if a foundation model helps at all. - ⚠ API call cost × millions of keywords × monthly cadence is the main blocker; would need a local model for production. - ⚠ Closed model = no audit of training data overlap with our distribution; coverage on long-tail / multilingual keywords unknown. - 🔧 from nixtla import NixtlaClient; nixtla_client = NixtlaClient(api_key=...); nixtla_client.forecast(df, h=12, level=[80, 95]). For deeper integration, see https://docs.nixtla.io/.

title: Chronos (Amazon) tags: [modern-ml, foundation-model, transformer, t5, pretrained, open-weights] applies_to: [tier_2, tier_3] data_needs: "Any series length (model is zero-shot); GPU strongly recommended for inference" status: candidate

Chronos (Amazon)¶

Source: Ansari, Stella, Turkmen, Zhang, Mercado, Shen, Shchur, Rangapuram, Pineda Arango, Kapoor, Zschiegner, Maddix, Mahoney, Torkkola, Wilson, Bohlke-Schneider & Wang 2024, "Chronos: Learning the Language of Time Series" Link: https://arxiv.org/abs/2403.07815 ; https://github.com/amazon-science/chronos-forecasting Retrieved: 2026-05-15

What it is: Tokenize a time series by scaling and quantizing values into a fixed vocabulary, then feed to a T5 encoder-decoder language model pretrained on a large public corpus of time-series. At inference, decode multiple completions and aggregate to get quantile forecasts. Open-weights on HuggingFace in small/base/large sizes; competitive with task-specific models in the paper's zero-shot benchmarks.

When to use: - You want a pretrained foundation model you can self-host (no third-party API). - Zero-shot or light-touch fine-tuning fits your operational model better than per-keyword training. - You're willing to run GPU inference (or accept slower CPU inference for small variant).

Fit for our model: - ✅ Open weights and self-hostable on the Ahrefs ML cluster — addresses the data-privacy concern of TimeGPT. - ✅ Native quantile output supports calibration evaluation via CRPS. - ⚠ Chronos paper's evaluation is mostly on hourly/daily benchmarks; published monthly evaluation is thinner. Need our own bake-off. - ⚠ Inference cost is significantly higher than per-series StatsForecast (processing.py:1984) even on GPU; would likely be tier_3-only or used as an ensemble member. - 🔧 pip install chronos-forecasting; pipeline = ChronosPipeline.from_pretrained("amazon/chronos-t5-small"); pipeline.predict(context, prediction_length=12, num_samples=100).

title: Moirai (Salesforce) tags: [modern-ml, foundation-model, transformer, universal, open-weights] applies_to: [tier_2, tier_3] data_needs: "Any frequency; zero-shot; GPU recommended" status: candidate

Moirai (Salesforce)¶

Source: Woo, Liu, Kumar, Xiong, Savarese & Sahoo 2024, "Unified Training of Universal Time Series Forecasting Transformers" Link: https://arxiv.org/abs/2402.02592 ; https://github.com/SalesforceAIResearch/uni2ts Retrieved: 2026-05-15

What it is: A "universal" transformer pretrained on the LOTSA corpus (~27B observations across many domains and frequencies). Handles variable input lengths and frequencies in one model via a multi-patch attention design and a mixture distribution output. Open-weights in three sizes; supports zero-shot and fine-tuned modes.

When to use: - You want a frequency-agnostic foundation model (we're monthly today but may want daily/weekly later). - You want one model for many keyword types rather than per-tier custom logic. - You're already evaluating Chronos — Moirai is the natural second comparator.

Fit for our model: - ✅ Frequency-agnostic design lets us reuse the same checkpoint for higher-cadence GSC daily aggregates without retraining. - ✅ Mixture-distribution output gives probabilistic forecasts compatible with reliability diagrams. - ⚠ Same compute caveat as Chronos: GPU inference is not free at our scale. - ⚠ Independent published evaluation on monthly retail/web data is limited; rely on our own bake-off. - 🔧 pip install uni2ts; module = MoiraiForecast.load_from_checkpoint("Salesforce/moirai-1.0-R-small", prediction_length=12, ...).

title: TimesFM (Google) tags: [modern-ml, foundation-model, decoder-only, transformer, open-weights] applies_to: [tier_2, tier_3] data_needs: "Any length input (model uses up to 512-context); zero-shot; GPU recommended" status: candidate

TimesFM (Google)¶

Source: Das, Kong, Sen & Zhou 2024, "A decoder-only foundation model for time-series forecasting" Link: https://arxiv.org/abs/2310.10688 ; https://github.com/google-research/timesfm Retrieved: 2026-05-15

What it is: Decoder-only transformer (200M parameters) pretrained on 100B time-points spanning Google Trends, Wikipedia traffic, electricity, etc. Patches contiguous time windows into tokens; predicts the next-patch in a GPT-style autoregressive setup. Open weights on HuggingFace; zero-shot competitive with task-specific models on Monash and Darts benchmarks.

When to use: - You want the smallest plausible foundation model (200M parameters fits in modest GPU memory). - The training data overlap with web/search data (Google Trends in training) is appealing for our domain. - You want a simple, GPT-style API (give me past tokens, get future tokens).

Fit for our model: - ✅ Training corpus includes Google Trends — likely the best in-distribution foundation model for our use case. - ✅ Small enough to consider running on CPU for batch inference if GPU is constrained. - ⚠ Point forecasts only out-of-the-box (no native quantiles); pair with conformal for intervals. - ⚠ Same general caveat as other foundation models: gains over a tuned statistical ensemble on monthly aggregates are an empirical question, not yet established for our distribution. - 🔧 pip install timesfm; tfm = timesfm.TimesFm(...); tfm.load_from_checkpoint(repo_id="google/timesfm-1.0-200m"); tfm.forecast(inputs=[...], freq=[1]).

title: Lag-Llama tags: [modern-ml, foundation-model, decoder-only, llama, open-weights, probabilistic] applies_to: [tier_2, tier_3] data_needs: "Any length; zero-shot; GPU recommended" status: candidate

Lag-Llama¶

Source: Rasul, Ashok, Williams, Khorasani, Adamopoulos, Bhagwatkar, Biloš, Ghonia, Hassen, Schneider, Garg, Drouin, Chapados, Nevmyvaka & Rish 2023/2024, "Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting" Link: https://arxiv.org/abs/2310.08278 ; https://github.com/time-series-foundation-models/lag-llama Retrieved: 2026-05-15

What it is: Decoder-only Llama-style transformer that takes lagged values of the series as input tokens (not a contiguous patch). At each step, outputs the parameters of a Student-t distribution; sampling many trajectories yields quantile forecasts. Trained on the Monash time-series archive; open-weights on HuggingFace.

When to use: - You want a foundation model that is natively probabilistic (Student-t output) without a separate conformal step. - You're comparing across Chronos, TimesFM, Moirai, and Lag-Llama for a zero-shot bake-off — Lag-Llama is the smallest and easiest to deploy. - Student-t output (heavy-tailed) matches your data better than a Gaussian forecast head.

Fit for our model: - ✅ Probabilistic-by-design — fits naturally into a CRPS-evaluated pipeline. - ⚠ Trained corpus is smaller than Chronos or TimesFM; reported zero-shot accuracy is competitive but not best-in-class. - ⚠ Same compute concern: any foundation model is GPU-bound and an order of magnitude slower than per-series ETS at our scale. - ⚠ For our monthly aggregates with mostly <60 points per series, the marginal accuracy gain over a tuned StatsForecast ensemble (processing.py:1984) is unlikely to justify the infra cost unless used selectively on top tier_3 keywords. - 🔧 pip install lag-llama; load checkpoint from time-series-foundation-models/Lag-Llama on HuggingFace; pair with gluonts predictor API.