Multi-Source Blending¶

Methods for principled combination of our four noisy, partial sources — JS (Jumpshot clickstream, stale), GSC (Google Search Console impressions), GT (Google Trends 0-100 relative), GKP (Google Keyword Planner volume) — to replace the naive max(JS, GSC) per-month rule at processing.py:1763.

title: Kalman filter / Unobserved Components Model (UCM) tags: [blending, state-space, kalman, latent-state] applies_to: [tier_1, tier_2, tier_3] data_needs: "Multiple noisy observation streams; some calibration of per-source observation noise variance." status: candidate

Kalman filter / Unobserved Components Model (UCM)¶

Source: Durbin, J. & Koopman, S. J., "Time Series Analysis by State Space Methods" (2nd ed., 2012) Link: https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.structural.UnobservedComponents.html Retrieved: 2026-05-15

What it is: Model the true monthly volume as a latent state μ_t that evolves smoothly over time (e.g., local level + seasonal + drift). Each source — JS, GSC, GT, GKP — is treated as a noisy observation y_i,t = α_i · μ_t + ε_i,t where ε_i,t ~ N(0, σ_i²). The Kalman filter recursively combines all available observations, weighting each inversely to its noise variance and to its calibration α_i, and produces a smoothed posterior over the latent volume. Naturally handles missing months (skip the observation update for that source) and propagates uncertainty.

When to use: - You have multiple sources observing the same underlying quantity with different noise characteristics. - You need missing-data handling and uncertainty propagation out of the box. - The latent series has structural components (trend, seasonality) that should be modelled jointly.

Fit for our model: - ✅ Best principled replacement for the max(JS, GSC) blend at processing.py:1763. JS-as-stale, GSC-with-bot-noise, GT-as-relative, GKP-as-banded all fit naturally as different observation equations. - ✅ Calibrated σ_i per source maps directly onto the data_freshness / coverage subscores at processing.py:1041 — confidence becomes the posterior variance of μ_t. - ✅ Plays well with the completeness-weighted current-month at processing.py:1787 (alpha = completeness²): low-completeness observations get higher σ_i and contribute less. - ⚠ Needs α_i calibration — GT scale isn't comparable to JS/GSC; fit α per keyword or per cluster as part of model setup. - ⚠ Gaussian noise is an approximation; log-transform first (log(1+y)) to handle the heavy right tail and zero-inflation. - 🔧 Python: statsmodels.tsa.statespace.structural.UnobservedComponents(endog=multivariate, ...) — multivariate version supports K observation series. For more flexibility (heteroscedastic noise, non-Gaussian), step up to Dynamic linear models with multiple observations in PyMC.

title: Bayesian sensor fusion tags: [blending, bayesian, sensor-fusion, posterior-update] applies_to: [tier_2, tier_3] data_needs: "Per-source likelihood + a prior; benefits from a calibration set." status: candidate

Bayesian sensor fusion¶

Source: Conceptual framework — see Bishop, "Pattern Recognition and Machine Learning" §3.3, and Koks & Challa, "An Introduction to Bayesian and Dempster-Shafer Data Fusion" (2003). PyMC implementation reference: PyMC Discourse "Bayesian data fusion" threads. Link: https://discourse.pymc.io/t/how-to-create-bayesian-data-fusion-in-python-with-pymc3/10558 Retrieved: 2026-05-15

What it is: Place a prior p(μ) on the true monthly volume μ, then sequentially update with each source's likelihood p(y_i | μ). Because likelihoods multiply, the posterior naturally combines sources, weighting each by its precision (inverse variance). When sources are mutually independent, the order of update doesn't matter; conjugate priors (Normal-Normal) give closed-form updates equivalent to inverse-variance weighting; non-conjugate cases need MCMC / variational inference. The principled generalization of Bates-Granger, with first-class uncertainty.

When to use: - Sources have heterogeneous noise distributions (some Gaussian, some Poisson, some heavy-tailed). - You want a posterior over volume, not a point estimate — useful for confidence intervals. - You have prior knowledge worth encoding (e.g., similar-keyword empirical Bayes priors).

Fit for our model: - ✅ A natural generalization of the multi-source blend at processing.py:1763 that handles GT's non-Gaussian noise, GKP's banded values, and GSC's zero-inflation with different likelihoods. - ✅ Posterior variance feeds the agreement subscore at processing.py:1041 directly — high disagreement between sources widens the posterior, lowering confidence. - ✅ Stale JS (presence-only signal) can enter as a weak prior rather than a noisy observation, matching the existing freshness logic. - ⚠ Full Bayesian inference per keyword is expensive — use closed-form Normal-Normal where possible and only escalate to MCMC for diagnostic / top-N validation. - ⚠ Risk of double-counting if sources are correlated (GT and GSC both reflect search demand); model correlation explicitly or down-weight. - 🔧 Python: closed-form Normal-Normal in NumPy for production; pymc for prototyping and diagnostic comparison.

title: Bates-Granger optimal forecast combination tags: [blending, combination, inverse-variance, classical] applies_to: [tier_1, tier_2, tier_3] data_needs: "Out-of-sample error history for each source/forecast (≥ a few months)." status: candidate

Bates-Granger optimal forecast combination¶

Source: Bates, J. M. & Granger, C. W. J., "The Combination of Forecasts" (Operational Research Quarterly, 1969) Link: https://www.jstor.org/stable/3008764 Retrieved: 2026-05-15

What it is: Combine K forecasts (or K source-observations) with weights w_i ∝ 1/σ_i², where σ_i² is the historical mean-squared error of source i. Weights sum to one. Under Gaussian, independent errors this is the minimum-variance unbiased combination — the foundational result for forecast combination. Half a century of subsequent work (Granger-Ramanathan, Diebold-Pauly, Newbold-Granger…) extends this with bias correction, shrinkage, and covariance terms.

When to use: - Cheap, fast, transparent: drop-in replacement for hand-tuned weights. - You have a validation history per source/forecast to estimate σ_i². - A safe default before investing in something heavier like Kalman/UCM.

Fit for our model: - ✅ Minimum viable upgrade to processing.py:1763: instead of max(JS, GSC), weight each source by inverse of its historical squared error against a hold-out target. Requires only an offline calibration pass per source (or per source × tier). - ✅ Maps neatly onto the source-quality intuition already encoded across the freshness and coverage subscores in compute_confidence_score() (processing.py:1041). - ✅ A natural building block: use Bates-Granger weights as the point estimate, then bolt on Conformal intervals. - ⚠ Classical caveat — "the optimal weights of Bates-Granger do not appear to work well in practice" (Smith & Wallis 2009) when error variances are estimated rather than known; consider shrinkage to equal weights for short calibration windows. - ⚠ Ignores correlations between sources; if GT and GSC have correlated errors, inverse-variance overweights them jointly. Use MinT or the full covariance form. - 🔧 Python: trivial in NumPy (10 lines). For a packaged version with shrinkage, see R's ForecastComb (no first-class Python port; reproduce manually).

title: MinT (Minimum Trace) hierarchical reconciliation tags: [blending, reconciliation, hierarchical, optimal] applies_to: [tier_2, tier_3] data_needs: "A hierarchy (e.g., keyword → country → global) and per-node forecasts; residual covariance matrix." status: candidate

MinT (Minimum Trace) hierarchical reconciliation¶

Source: Wickramasuriya, S. L., Athanasopoulos, G. & Hyndman, R. J., "Optimal Forecast Reconciliation for Hierarchical and Grouped Time Series Through Trace Minimization" (JASA, 2019) Link: https://robjhyndman.com/papers/mint.pdf Retrieved: 2026-05-15

What it is: When you have forecasts at multiple levels of a hierarchy (keyword totals, per-country, per-device), they generally won't add up — child forecasts don't sum to parents. MinT projects all base forecasts onto the space of coherent forecasts (those that satisfy the aggregation constraint) such that the trace of the reconciled forecast-error covariance matrix is minimised. The projection uses the residual covariance matrix W of the base forecasts; choices include OLS (W = I), WLS (diagonal W from variances), and MinT(shrink) (shrunk full covariance).

When to use: - Your forecasts have natural hierarchical or grouped structure that must be coherent. - You want to share strength across the hierarchy — top-level forecasts benefit from bottom-level structure and vice versa.

Fit for our model: - ⚠ Less directly applicable to the max(JS, GSC) problem at processing.py:1763 — MinT reconciles forecasts at different aggregations, not noisy observations of the same target. - ✅ But: if we reframe sources as different aggregations (GKP gives quarterly avg, GSC gives monthly impressions, JS gives historical totals) and reconcile across them, MinT could enforce that monthly blended values sum to the GKP quarterly value. Speculative. - ✅ Strong fit for cross-keyword aggregation — e.g., reconciling forecasts for python (head term) with python tutorial, python pandas etc. so they cohere with a topic-level total. - ⚠ Requires estimating the full residual covariance matrix W; MinT(shrink) (Schäfer-Strimmer shrinkage) is the practical default. - 🔧 Python: hierarchicalforecast (Nixtla) — from hierarchicalforecast.methods import MinTrace; MinTrace(method='mint_shrink'). Docs: https://nixtlaverse.nixtla.io/hierarchicalforecast/

title: OLS / WLS forecast combination tags: [blending, combination, regression, classical] applies_to: [tier_2, tier_3] data_needs: "Historical base forecasts and realised values for ≥ a few dozen periods." status: candidate

OLS / WLS forecast combination¶

Source: Hyndman & Athanasopoulos, "Forecasting: Principles and Practice" (3rd ed.) §13.4 (combinations); Granger & Ramanathan, "Improved methods of combining forecasts" (J. Forecasting, 1984). Link: https://otexts.com/fpp3/combinations.html Retrieved: 2026-05-15

What it is: Treat each base forecast (or per-source observation) as a regressor and the realised value as the target. Fit a linear regression y_t = β_0 + Σ β_i ŷ_i,t + ε_t over the calibration period. OLS gives equal weight to all residuals; WLS down-weights periods you trust less (e.g., older months, periods with known data issues). Adds a free intercept (bias correction) compared to Bates-Granger's inverse-variance recipe. With the constraint Σ β_i = 1 it reduces to a constrained combination weight; without, it can shrink poor sources toward zero entirely.

When to use: - You have enough out-of-sample history to estimate K+1 regression coefficients reliably. - Sources have systematic bias (not just variance) that you want to correct. - You want one of {OLS, constrained, WLS} as a baseline before fancier Bayesian methods.

Fit for our model: - ✅ Strong baseline to compare against the current max(JS, GSC) at processing.py:1763. Easy to implement and audit. - ✅ WLS variant naturally encodes recency (recent months weight higher), aligning with our freshness scoring (processing.py:1041). - ✅ Compatible with the held-out year already used at processing.py:1984 — that's exactly the calibration sample needed. - ⚠ Overfits with K sources and small training windows; regularize (Ridge / shrinkage to equal weights) for short histories. - ⚠ Doesn't propagate per-month uncertainty; pair with prediction intervals or step up to Kalman/UCM. - 🔧 Python: statsmodels.regression.linear_model.OLS / WLS; sklearn.linear_model.Ridge; or use the darts.models.RegressionEnsembleModel for an end-to-end version.

title: Stacking / meta-learners for forecast combination tags: [blending, ml, meta-learning, fforma] applies_to: [tier_2, tier_3] data_needs: "Large cross-keyword corpus with realised values and base-forecast errors; engineered series features." status: candidate

Stacking / meta-learners for forecast combination¶

Source: Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J. & Talagala, T. S., "FFORMA: Feature-based forecast model averaging" (Int. J. Forecasting, 2020) Link: https://robjhyndman.com/publications/fforma/ Retrieved: 2026-05-15

What it is: Train a meta-learner (gradient-boosted tree, e.g., LightGBM/XGBoost) whose features are time-series features of the keyword series (entropy, trend strength, seasonal strength, ACF/PACF features…) and whose target is the weights over base models/sources that minimise the weighted forecast loss. At inference time, extract features for a new keyword and the meta-learner produces a custom weighting on the base set. FFORMA placed 2nd in the M4 competition with this design.

When to use: - You have a catalogue of base methods (we have JS, GSC, GT, GKP as sources, plus AutoCES, HW-add, HW-mul as forecast models) and want per-series weighting. - You have a large training set of series with realised values to learn the mapping from features → optimal weights. - You're willing to take on the ML pipeline (feature extraction, training, serving).

Fit for our model: - ✅ Conceptually the most powerful upgrade for both processing.py:1763 (source blend) and processing.py:1984 (model ensemble): one stacker decides source weights and model weights per keyword. - ✅ Per-tier weight choices already implied by calculate_hybrid_growth() (processing.py:1247) become learned, not hand-coded. - ✅ The cross-keyword corpus already produced per-shard could be reused as training data with minimal additional pipeline. - ⚠ Engineering complexity (training pipeline, drift monitoring, weights versioning) — not a quick win. - ⚠ Risk of overfitting to historical regimes; train on rolling windows and revalidate. - ⚠ Black-box vs the hand-tuned weights — audit harder; pair with SHAP for explanations. - 🔧 Python: tsfeatures (Nixtla) for features + LightGBM / XGBoost for the meta-model. There's a fforma reference implementation by the authors; the Nixtla mlforecast ecosystem covers the engineering side.

title: Particle filtering for non-Gaussian state space tags: [blending, state-space, particle-filter, sequential-monte-carlo] applies_to: [tier_3] data_needs: "Likelihood / transition model per source; computational budget for ≥1000 particles per keyword." status: candidate

Particle filtering for non-Gaussian state space¶

Source: Doucet, A., de Freitas, N. & Gordon, N., "Sequential Monte Carlo Methods in Practice" (Springer, 2001) Link: https://www.stats.ox.ac.uk/~doucet/smc_resources.html Retrieved: 2026-05-15

What it is: The Kalman filter's generalisation when the state transition and/or observation equations are non-linear or non-Gaussian. Represent the posterior over the latent state by a population of N weighted particles (Monte Carlo samples). At each step: propagate particles forward through the transition, weight them by the observation likelihood, and resample to avoid weight degeneracy. As N → ∞ this approximates the true posterior. Handles arbitrary noise distributions, multimodal posteriors, and discrete states.

When to use: - Your observation noise is clearly non-Gaussian — zero-inflated counts, heavy tails, discrete buckets (like GKP's bands). - Multiple plausible regimes (post-shift, pre-shift) need to be tracked simultaneously rather than collapsed to a single mean. - You can afford O(N · T) computation per keyword.

Fit for our model: - ✅ Could model GKP-bands as a discrete observation likelihood, GSC as Poisson, GT as Beta-on-[0,100], JS as a stale prior — all simultaneously feeding into the same latent volume. Strictly more expressive than Kalman/UCM. - ⚠ 200-shard daily processing makes per-keyword particle filtering expensive — likely viable only for top-N high-value keywords or as offline backfills. - ⚠ Particle degeneracy and resampling noise need careful handling. - 🔧 Python: particles (Chopin et al.), pyfilter, or roll your own — the algorithm is ~50 lines. For multi-output state space with non-Gaussian obs, also see numpyro's SMC sampler.

title: Dynamic linear models with multiple observations tags: [blending, bayesian, state-space, dlm] applies_to: [tier_2, tier_3] data_needs: "Multiple observation streams; flexible noise/structure specification." status: candidate

Dynamic linear models with multiple observations¶

Source: West, M. & Harrison, J., "Bayesian Forecasting and Dynamic Models" (2nd ed., Springer, 1997) Link: https://link.springer.com/book/10.1007/b98971 Retrieved: 2026-05-15

What it is: A general class of Bayesian state-space models. The latent state evolves as θ_t = G_t θ_{t-1} + w_t with w_t ~ N(0, W_t); observations come from y_t = F_t θ_t + v_t with v_t ~ N(0, V_t). The matrices F_t, G_t, W_t, V_t can be time-varying (which generalises the Kalman filter) and easily support multiple observation equations stacked into F_t. West & Harrison's textbook formalism makes it natural to add components (trend, seasonal, regression) as building blocks and to do discount-based dynamic updating without specifying all variances. Modern Python re-incarnations (PyMC, Orbit by Uber) make this practical.

When to use: - You want full Bayesian inference with custom structure per keyword (trend + seasonal + regression on covariates). - You want time-varying noise — high-noise periods (GT collapse, GSC outage) should automatically receive less weight. - You're prototyping; the DLM formalism is the most flexible state-space framework.

Fit for our model: - ✅ Strict generalisation of Kalman/UCM (processing.py:1763 blend candidate) — start with UCM in statsmodels, escalate to DLM if you need bespoke structure per tier. - ✅ Discount factor parameterisation of West-Harrison is a clean way to down-weight stale history (e.g., pre-Google-update months); replaces brittle "is older than 12 months" cutoffs in the freshness subscore (processing.py:1041). - ✅ The framework integrates with the prior-trend shadow / frozen months logic at processing.py:2122 — frozen months become tight-prior observations. - ⚠ Heavier than Kalman/UCM; only go here if UCM is genuinely insufficient. - ⚠ Orbit and PyMC each have learning curves; not a drop-in replacement. - 🔧 Python: pymc (full Bayesian, custom likelihoods, pymc.GaussianRandomWalk building blocks), uber/orbit (DLM-based with a high-level API focused on marketing time series), pydlm (West-Harrison closer to the book, less actively maintained).