Skip to content

GSC (Google Search Console)

Code: kvprocessor.cppPullFreshGSCData() / ProcessGSCData(). Per-keyword validity gate: processing.py:KB-ANCHOR:is-legit-impressions-3mo. Last validated: 2026-05-21

What it carries

Per-keyword × domain × device × search-type × date — impressions, clicks, CTR, position. The richest and freshest of the four raw sources; the only one with intra-day granularity and per-domain/device/search-type breakdowns. Everything customer-facing that says "monthly volume" or "trend" leans on GSC as the primary signal once a keyword has enough history.

Where it lives and how it gets here

The GSC daemon in backend/megaindex/gsc/ (see memory: GSC data comes from direct GSC API calls via that OCaml daemon, not from S3) writes into gsc.gsc_main3_local and gsc.gsc_bykeyword_local on every laksa shard. Materialised views aggregate those into gsc.aggregated_keywords_local (per-shard) and then re-shard by keyword hash into gsc.aggregated_keywords_global (the lookup table kvprocessor reads). kvprocessor.cpp pulls from aggregated_keywords_global on each peer shard, dedups, rolls up to monthly, and writes the 16 columns above into keywords.keywords_data_local.

The freshness filter on aggregated_keywords_global is metrics_imported_at_max > DATE_SUB(curdate(), 45) — a 45-day window.

Why the 45-day window

The 45-day window is an operational performance knob, not a quality threshold. Older rows are filtered out at source-table read time to keep the cross-shard pull cheap enough that kvprocessor can complete in the hourly loop without blowing out the laksa server workers. It is not expressing "data older than 45 days is wrong". Keep this distinction in mind when debugging "where did my old data go?" — older GSC history is still in the upstream gsc_main3_local raw table, just not in the kvprocessor read path.

Validity gates in processing.py

  • Single-month staleness — a keyword whose GSC data is single-month and >3 months old is treated as not-present (processing.py:KB-ANCHOR:is-legit-impressions-3mo).
  • Spike corroboration — uncorroborated GSC impression spikes (e.g. bot-driven surges) are repaired in-place using GT/GKP as the truth signal. See the Decisions section for the spike-detection rules.
  • Outer gate downstream — the customer-facing trend display gate processed_keyword_volume > 200 (memory: project_prod_trend_display_gate) is computed after the GSC blend, so unhealthy GSC doesn't sneak through.

See also

Reference docs (external):

Internal:

  • Central hub table — where the columns above land (keywords.keywords_data_local)
  • kvprocessor — the program that pulls GSC
  • Decisions — bot/spike detection, freshness gates
  • gsc-debugging Claude skill — how to drill into GSC issues for a specific keyword
  • Column-level reference — TBD: a future per-column page lives under ../fields/