Architecture¶
Status: placeholder. The detailed prose for each stage will be filled in Phase 2 alongside the source / process / field nodes. The high-level Mermaid diagram below is a working draft based on the exploration audit — names and edges will be refined as Phase 2 lands.
End-to-end view of the keyword-volume pipeline, from raw external feeds to customer-visible metrics.
Pipeline at a glance¶
flowchart LR
subgraph Sources["External data sources"]
GSC[GSC]
GT[Google Trends]
GKP[GKP]
JS[Jumpshot]
end
subgraph Ingest["Stage 1 · Ingestion"]
KVP[kvprocessor]
end
CENTRAL[(keywords_data_local<br/>200 shards × 256 ranges)]
subgraph Process["Stage 2 · Processing"]
PROC[processing.py]
end
subgraph PE["Stage 3 · PE update"]
PEU[pe_update.py]
end
subgraph Upload["Stage 4 · Upload"]
UCH[CH metrics]
UES[Elasticsearch]
end
subgraph Consumers["Downstream consumers"]
API[Customer API]
DEMO[Demo dashboard]
PESYS[Position Explorer]
end
subgraph Refresh["Refresh extraction"]
EGT[extract_gt]
EGKP[extract_gkp]
end
GSC --> KVP
GT --> KVP
GKP --> KVP
JS --> KVP
KVP --> CENTRAL
CENTRAL <--> PROC
CENTRAL <--> PEU
CENTRAL --> UCH
CENTRAL --> UES
UES --> API
CENTRAL --> DEMO
UCH --> PESYS
CENTRAL -.signals.-> EGT
CENTRAL -.signals.-> EGKP
EGT -.refresh.-> CENTRAL
EGKP -.refresh.-> CENTRAL
Legend¶
| Node | What it carries / does |
|---|---|
| GSC | Google Search Console — impressions, clicks, CTR, position per keyword × domain × device × day |
| Google Trends | Relative trend index 0–100 by month, since 2014 |
| GKP | Google Keyword Planner — monthly volume + CPC + top-of-page bids |
| Jumpshot | One-time historical volume + organic click %; stale — never refreshes |
| kvprocessor | C++ ingestion: pulls from each source, dedups, rolls up to monthly, writes raw arrays |
| keywords_data_local | The hub — 200 laksa shards × 256 ranges; raw *_array_store arrays and derived processed_* columns live here |
| processing.py | Slurm job (one Python process per shard) — blend, smooth, classify trend type, forecast, score confidence |
| pe_update.py | Decides the top/bottom 400k keywords for Position Explorer tracking |
| CH metrics / Elasticsearch | Stage-4 uploads of the processed_* columns to downstream stores |
| extract_gt / extract_gkp | Signal-driven refresh extractors (independent cadence) — see refresh logic |
The four stages¶
- Ingestion (
kvprocessor.cpp) — pulls fresh data from each remote ClickHouse host (isog, laksa), consolidates into the central tablekeywords.keywords_data_local. Writes raw time-series arrays per source. - Processing (
processing.py) — runs as a Slurm job across the 200 laksa shards. Reads raw arrays, computes derived metrics (volume blend, trend merge, growth classification, forecast, confidence), writes theprocessed_*columns back to the same table. - Position Explorer update (
pe_update.py) — decides which keywords belong in PE tracking based on the freshly processed volume and growth. - Backend upload (
upload_backend.sh) — distributes the processed columns to the customer-facing ClickHouse (keywords_metrics_local) and to Elasticsearch indices.
Refresh extraction (extract_gt.py, extract_gkp.py) runs on a separate cadence, driven by signals stored in processed_volume_trend_meta.
Orchestration¶
The four stages are chained hourly by keep_running.sh under screen session process_keyword on data@isog. Refresh extraction runs independently on a daily/hourly schedule via extract_gkp_gt.sh.
Where to dig deeper¶
Each node in the diagram has its own page (or will, once Phase 2 lands):
- Sources → System / Sources in the sidebar
- The central table → System / Tables
- Each script/stage → System / Processes
- Each
processed_*column → System / Fields - Each algorithmic choice (~28 of them) → System / Decisions
- Each customer-facing metric → System / Metrics
- Each downstream consumer → System / Consumers
For symptom-first debugging ("why is this keyword's volume 0?"), see Problems.