Architecture¶

Status: placeholder. The detailed prose for each stage will be filled in Phase 2 alongside the source / process / field nodes. The high-level Mermaid diagram below is a working draft based on the exploration audit — names and edges will be refined as Phase 2 lands.

End-to-end view of the keyword-volume pipeline, from raw external feeds to customer-visible metrics.

Pipeline at a glance¶

flowchart LR
  subgraph Sources["External data sources"]
    GSC[GSC]
    GT[Google Trends]
    GKP[GKP]
    JS[Jumpshot]
  end

  subgraph Ingest["Stage 1 · Ingestion"]
    KVP[kvprocessor]
  end

  CENTRAL[(keywords_data_local<br/>200 shards × 256 ranges)]

  subgraph Process["Stage 2 · Processing"]
    PROC[processing.py]
  end

  subgraph PE["Stage 3 · PE update"]
    PEU[pe_update.py]
  end

  subgraph Upload["Stage 4 · Upload"]
    UCH[CH metrics]
    UES[Elasticsearch]
  end

  subgraph Consumers["Downstream consumers"]
    API[Customer API]
    DEMO[Demo dashboard]
    PESYS[Position Explorer]
  end

  subgraph Refresh["Refresh extraction"]
    EGT[extract_gt]
    EGKP[extract_gkp]
  end

  GSC --> KVP
  GT --> KVP
  GKP --> KVP
  JS --> KVP
  KVP --> CENTRAL
  CENTRAL <--> PROC
  CENTRAL <--> PEU
  CENTRAL --> UCH
  CENTRAL --> UES
  UES --> API
  CENTRAL --> DEMO
  UCH --> PESYS

  CENTRAL -.signals.-> EGT
  CENTRAL -.signals.-> EGKP
  EGT -.refresh.-> CENTRAL
  EGKP -.refresh.-> CENTRAL

Legend¶

Node	What it carries / does
GSC	Google Search Console — impressions, clicks, CTR, position per keyword × domain × device × day
Google Trends	Relative trend index 0–100 by month, since 2014
GKP	Google Keyword Planner — monthly volume + CPC + top-of-page bids
Jumpshot	One-time historical volume + organic click %; stale — never refreshes
kvprocessor	C++ ingestion: pulls from each source, dedups, rolls up to monthly, writes raw arrays
keywords_data_local	The hub — 200 laksa shards × 256 ranges; raw `_array_store` arrays and* derived `processed_*` columns live here
processing.py	Slurm job (one Python process per shard) — blend, smooth, classify trend type, forecast, score confidence
pe_update.py	Decides the top/bottom 400k keywords for Position Explorer tracking
CH metrics / Elasticsearch	Stage-4 uploads of the `processed_*` columns to downstream stores
extract_gt / extract_gkp	Signal-driven refresh extractors (independent cadence) — see refresh logic

The four stages¶

Ingestion (kvprocessor.cpp) — pulls fresh data from each remote ClickHouse host (isog, laksa), consolidates into the central table keywords.keywords_data_local. Writes raw time-series arrays per source.
Processing (processing.py) — runs as a Slurm job across the 200 laksa shards. Reads raw arrays, computes derived metrics (volume blend, trend merge, growth classification, forecast, confidence), writes the processed_* columns back to the same table.
Position Explorer update (pe_update.py) — decides which keywords belong in PE tracking based on the freshly processed volume and growth.
Backend upload (upload_backend.sh) — distributes the processed columns to the customer-facing ClickHouse (keywords_metrics_local) and to Elasticsearch indices.

Refresh extraction (extract_gt.py, extract_gkp.py) runs on a separate cadence, driven by signals stored in processed_volume_trend_meta.

Orchestration¶

The four stages are chained hourly by keep_running.sh under screen session process_keyword on data@isog. Refresh extraction runs independently on a daily/hourly schedule via extract_gkp_gt.sh.

Where to dig deeper¶

Each node in the diagram has its own page (or will, once Phase 2 lands):

Sources → System / Sources in the sidebar
The central table → System / Tables
Each script/stage → System / Processes
Each processed_* column → System / Fields
Each algorithmic choice (~28 of them) → System / Decisions
Each customer-facing metric → System / Metrics
Each downstream consumer → System / Consumers

For symptom-first debugging ("why is this keyword's volume 0?"), see Problems.