Shape-based Forecast Evaluation¶
Metrics and strategies for evaluating forecasts whose value lies in matching the shape (peak timing, amplitude profile) rather than just point-wise magnitude — directly motivated by the seasonal alt-gate at processing.py:2054.
title: Why SMAPE fails on narrow seasonals tags: [evaluation, smape, analysis] applies_to: [tier_2, tier_3] data_needs: "N/A — analytical note" status: candidate
Why SMAPE fails on narrow seasonals¶
Source: Makridakis, Spiliotis & Assimakopoulos 2020, The M4 Competition: 100,000 time series and 61 forecasting methods; Hyndman & Koehler 2006 Link: https://www.sciencedirect.com/science/article/abs/pii/S0169207019301874 Retrieved: 2026-05-15
What it is: SMAPE — Symmetric Mean Absolute Percentage Error — is defined as SMAPE = mean(2·|f - a| / (|f| + |a|)). The denominator is the sum of forecast and actual magnitudes, which is near zero whenever both are near zero. For narrow-seasonal series (e.g., "halloween costumes", "super bowl ads") the actual is ~0 for 10-11 months of the year. In those months, a small absolute miss like f=50, a=0 yields 2·50/50 = 2.0 (saturated at the upper bound), while a perfect peak-match f=10000, a=10000 contributes 0.0. The off-peak months dominate the mean and a shape-correct forecast looks bad. M4 noted SMAPE's asymmetric behavior in detail and proposed using MASE alongside via OWA.
When to use:
- Use this analysis to justify replacing or augmenting SMAPE for series where the actual values are heavy-tailed or have many near-zero entries.
- Reference when explaining why our Pearson r > 0.5 override at processing.py:2054 was introduced.
Fit for our model:
- ✅ Direct rationale for replacing the alt-gate at processing.py:2054 with a shape-aware metric like DTW or scale-free MASE.
- ✅ Documents why we shouldn't tune the SMAPE threshold up to fix narrow-seasonal rejection — the metric is structurally wrong for the regime, not just mis-calibrated.
- ⚠ SMAPE is still fine for the bulk of keywords (steady demand); replacement should be tier/regime-aware, not blanket.
- 🔧 No library — this is a design rationale entry. See MASE, DTW, Pearson and Spearman correlation for replacements.
title: DTW (Dynamic Time Warping) tags: [evaluation, shape, alignment] applies_to: [tier_2, tier_3] data_needs: "Two equal-length-or-warpable series; windowed variants need a band parameter" status: candidate
DTW (Dynamic Time Warping)¶
Source: Sakoe & Chiba 1978, Dynamic programming algorithm optimization for spoken word recognition Link: https://tslearn.readthedocs.io/en/stable/user_guide/dtw.html Retrieved: 2026-05-15
What it is: Dynamic-programming algorithm that finds the optimal non-linear alignment between two sequences by warping the time axis. Returns a cumulative distance along the optimal path. Unlike pointwise error metrics, DTW tolerates small phase shifts (a forecast that peaks one month early vs the actual is penalized lightly), making it a natural "shape match" score. The Sakoe-Chiba band constrains how far the alignment can stray from the diagonal, preventing degenerate warpings.
When to use: - Series where peak alignment matters more than month-by-month magnitude. - Comparing two seasonal patterns whose peaks may be shifted (e.g., Easter shifts by date each year, lunar calendar holidays). - Clustering similar-shaped series.
Fit for our model:
- ✅ Strong candidate to replace the Pearson-correlation override at processing.py:2054: a small DTW distance directly measures shape similarity with tolerance for ±1 month phase error.
- ✅ Works well on the narrow-seasonal regime where SMAPE fails.
- ⚠ Quadratic time in series length per comparison; for 24-month backtests this is trivial, but per-keyword × millions adds up. Use a Sakoe-Chiba band (window=2) to bound it.
- ⚠ Magnitude-blind unless you z-normalize first — pair with a separate amplitude check (e.g., MASE on peak month only).
- 🔧 tslearn.metrics.dtw(s1, s2, global_constraint='sakoe_chiba', sakoe_chiba_radius=2) or the faster dtaidistance.dtw.distance(s1, s2, window=2) (C implementation).
title: Pearson and Spearman correlation tags: [evaluation, shape, correlation] applies_to: [tier_2, tier_3] data_needs: "Two equal-length series; needs variance > 0 in both" status: candidate
Pearson and Spearman correlation¶
Source: Standard statistics (Pearson 1895; Spearman 1904) Link: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html Retrieved: 2026-05-15
What it is: Pearson r measures linear association between two series after centering; Spearman ρ is Pearson on the ranks (monotonic but not necessarily linear). Both are scale- and location-invariant: a forecast that is 2·actual + 5 everywhere scores r = 1.0. This is what makes them useful as shape-only metrics — and what makes them blind to level and amplitude errors.
When to use: - You want a cheap shape-only sanity check that ignores both bias and scale. - As a complementary metric in a multi-metric panel (with MASE for scale, DTW for alignment).
Fit for our model:
- ✅ This is what processing.py:2054 already uses (Pearson r > 0.5). The entry documents why it's a partial solution: catches the easy "same shape, wrong magnitude" cases, misses peak-misalignment and amplitude-error cases.
- ⚠ Threshold of 0.5 is unprincipled; sensitive to outliers (a single big peak can drive r up without the rest matching).
- ⚠ Spearman is more robust to outliers but loses information about amplitude rank gaps. Worth comparing both.
- 🔧 scipy.stats.pearsonr(forecast, actual).statistic, scipy.stats.spearmanr(forecast, actual).correlation. Consider switching the alt-gate to (r > 0.5) AND (DTW_norm < threshold) for a stricter, shape-aware combined gate.
title: MASE (Mean Absolute Scaled Error) tags: [evaluation, scale-free, magnitude] applies_to: [tier_1, tier_2, tier_3] data_needs: "Train series with ≥m+1 observations (m = seasonal period) for the in-sample naive denominator" status: candidate
MASE (Mean Absolute Scaled Error)¶
Source: Hyndman & Koehler 2006, Another look at measures of forecast accuracy; fpp3 ch. 5.8 Link: https://otexts.com/fpp3/accuracy.html Retrieved: 2026-05-15
What it is: Scale-free error: MASE = MAE / mean(|y_t - y_{t-m}|), where the denominator is the in-sample mean absolute error of a seasonal-naive baseline (period m; use m=1 for non-seasonal). MASE < 1 means the model beats the seasonal-naive on the training data; MASE > 1 means it underperforms it. Unlike SMAPE, MASE has no exploding division-by-zero pathology and is interpretable on a common scale across heterogeneous series.
When to use: - Comparing forecast methods across keywords with very different volume scales (1k searches vs 10M searches). - You want a single metric whose value of 1.0 has a clear meaning ("equal to naive"). - The series has any near-zero stretches that would explode percentage-based metrics.
Fit for our model:
- ✅ Drop-in replacement (or co-metric) for SMAPE in the model-selection step at processing.py:1984/processing.py:2054. With m=12, MASE explicitly penalizes a forecast that loses to the seasonal-naive on a seasonal series — exactly the regime where our HW models can collapse to a flat line.
- ✅ Cheap to compute; behaves well when actual = 0.
- ⚠ Still magnitude-based (won't fix narrow-seasonal-shape problem on its own — pair with DTW or Pearson).
- 🔧 statsforecast and sktime both ship MASE; e.g., sktime.performance_metrics.forecasting.MeanAbsoluteScaledError(sp=12).
title: OWA (Overall Weighted Average) — M4 metric tags: [evaluation, composite, m4] applies_to: [tier_2, tier_3] data_needs: "Both SMAPE and MASE computable; reference benchmark (seasonal-naive) for normalization" status: candidate
OWA (Overall Weighted Average) — M4 metric¶
Source: Makridakis, Spiliotis & Assimakopoulos 2020, The M4 Competition: 100,000 time series and 61 forecasting methods Link: https://www.sciencedirect.com/science/article/abs/pii/S0169207019301874 Retrieved: 2026-05-15
What it is: Composite ranking metric used by the M4 competition. Normalize each of SMAPE and MASE by the corresponding metric for a seasonal-naive baseline, then average: OWA = 0.5 · (SMAPE/SMAPE_naive + MASE/MASE_naive). Seasonal-naive scores OWA = 1.0 by construction; a method with OWA < 1 beats it on both dimensions. Used precisely because SMAPE and MASE measure different things and each has known failure modes.
When to use: - Comparing forecast methods over a large heterogeneous pool of series and you don't want any single metric's pathology (SMAPE's zero-denominator blow-up; MASE's reliance on the naive denominator) to dominate. - Reporting accuracy at the level of a benchmark/competition.
Fit for our model:
- ✅ Natural metric for the per-tier evaluation that informs which forecast variant we trust in the ensemble at processing.py:1984. Reporting OWA across keyword tiers (Tier 1/2/3 from processing.py:1247) would give a single comparable headline number per release.
- ⚠ Still inherits SMAPE's narrow-seasonal pathology in half its weight — for that regime add DTW or peak-match metrics separately. OWA is a fleet-wide aggregator, not a per-series gate.
- 🔧 Compute SMAPE and MASE per series (e.g., via statsforecast.utils or sktime), then divide each by the naive's score across the eval set and average.
title: Scaled CRPS (probabilistic) tags: [evaluation, probabilistic, proper-scoring-rule] applies_to: [tier_2, tier_3] data_needs: "Forecast must be a distribution or quantile set, not a point" status: candidate
Scaled CRPS (probabilistic)¶
Source: Gneiting & Raftery 2007, Strictly Proper Scoring Rules, Prediction, and Estimation; GluonTS Evaluator docs Link: https://ts.gluon.ai/stable/tutorials/forecasting/extended_tutorial.html Retrieved: 2026-05-15
What it is: Continuous Ranked Probability Score generalizes MAE to probabilistic forecasts: CRPS(F, y) = ∫ (F(x) - 1{y ≤ x})² dx. Equal to MAE when the forecast is a point mass (degenerate distribution). Strictly proper — the expected CRPS is minimized only by the true distribution. The scaled CRPS (sCRPS or wQuantileLoss) normalizes by the sum of absolutes to make CRPS comparable across series with different scales. Typically computed from a finite set of forecast quantiles (commonly the 9 deciles).
When to use: - The model emits a forecast distribution (e.g., ETS or ARIMA with prediction intervals, quantile regressor, ensemble samples). - You want to evaluate calibration and sharpness with a single proper score. - Comparing probabilistic forecasters where pinball loss at a single quantile would miss the full picture.
Fit for our model:
- ⚠ Our current ensemble at processing.py:1984 emits point forecasts only; no native distribution. Would require switching at least one ensemble member to a probabilistic output (e.g., ARIMA(...).predict(level=[80, 95])) or bootstrapping the ensemble.
- ✅ Directly relevant to P7 in problems.md: replacing the heuristic confidence score at processing.py:1041 with calibrated intervals would let downstream tools reason about uncertainty.
- 🔧 gluonts.evaluation.Evaluator(quantiles=[0.1, 0.2, ..., 0.9]) returns mean_wQuantileLoss (≈ sCRPS); statsforecast exposes prediction intervals from which one can compute CRPS via quantile approximation.
title: Time-series cross-validation strategies tags: [evaluation, cross-validation, methodology] applies_to: [tier_1, tier_2, tier_3] data_needs: "Any length; minimum train set should support the model's parameter count" status: candidate
Time-series cross-validation strategies¶
Source: Hyndman & Athanasopoulos, fpp3 ch. 5.10 Link: https://otexts.com/fpp3/tscv.html Retrieved: 2026-05-15
What it is: Rolling-origin (a.k.a. expanding-window) evaluation. Pick an initial train length; produce an h-step forecast; record the error; slide the training window forward by one (or step steps), re-fit, re-forecast. Yields many error observations from a single series, all using only past data — no leakage. Two flavors: expanding window (train grows over time, classic for stationary processes) and sliding window (fixed train length, classic when older data is irrelevant due to regime change).
When to use:
- Validating any new forecasting method against a baseline before deployment.
- Picking hyperparameters (e.g., the SMAPE threshold at processing.py:2054, the K in Fourier regression) without overfitting to one holdout.
- Reporting an honest accuracy distribution rather than a single split.
Fit for our model:
- ✅ Should be the default evaluation harness for any change to the ensemble at processing.py:1984. Currently the SMAPE gate and Pearson override at processing.py:2054 use a single 12-month holdout — rolling-origin with h=12, step=3 over 4 origins would give 4× the data without 4× the cost (re-fits only).
- ✅ Inputs cleanly into OWA reporting: aggregate per-tier across all rolling-origin folds.
- ⚠ For short-history keywords (Tier 1/2 at processing.py:1247), you may have only one or zero usable origins; report sample sizes alongside the metric.
- 🔧 statsforecast.StatsForecast.cross_validation(h=12, step_size=3, n_windows=4) does this in one call across a panel of series; sktime.forecasting.model_selection.ExpandingWindowSplitter is the generic equivalent.
title: Soft-DTW and shape-aware losses tags: [evaluation, shape, differentiable, training-loss] applies_to: [tier_2, tier_3] data_needs: "Two series of equal length; gamma hyperparameter to tune smoothness" status: candidate
Soft-DTW and shape-aware losses¶
Source: Cuturi & Blondel 2017, Soft-DTW: a Differentiable Loss Function for Time-Series (ICML) Link: https://tslearn.readthedocs.io/en/stable/gen_modules/metrics/tslearn.metrics.soft_dtw.html Retrieved: 2026-05-15
What it is: A smoothed, differentiable version of DTW that replaces the min over alignments with a soft-min (log-sum-exp) controlled by a temperature γ. As γ → 0, soft-DTW recovers classical DTW; as γ grows, the loss becomes smoother and gradients become non-degenerate. The differentiability is what makes it usable as a training loss for neural forecasters or as a smooth metric whose gradient w.r.t. forecast values is well-defined for optimization.
When to use: - Training a neural forecaster (RNN, N-BEATS, TFT) where you want the loss to reward shape match rather than pointwise MSE. - Optimizing a hyperparameter (e.g., a smoothing window) via gradient descent on a held-out series. - Computing a smoother, less brittle shape-similarity score than hard DTW.
Fit for our model:
- ⚠ Our current ensemble at processing.py:1984 is non-neural; soft-DTW as a training loss only becomes relevant if/when we adopt a deep model (see methods/modern_ml.md).
- ✅ As an evaluation metric (replacing or augmenting the gate at processing.py:2054), soft-DTW with moderate γ is more numerically stable than hard DTW when many off-peak months have identical zero-ish values.
- 🔧 tslearn.metrics.soft_dtw(s1, s2, gamma=1.0) or tslearn.metrics.SoftDTWLossPyTorch for training. sktime exposes shape-based distances via aeon.distances as well.