keywords.keywords_data_local — the central hub¶
Schema: gsc.sql:L89–L193. Engine: ReplicatedAggregatingMergeTree. Order: (dimension_keyword_hash, country).
Last validated: 2026-05-21
What it is¶
The pipeline's single shared table. Every raw source array lands here (kvprocessor writes), every derived column is written back in place (processing.py), PE tracking flags are mutated here (pe_update.py), and external services (ML team's ml_orc) write their own columns to the same rows without coordinating with the keyword-volume team. The downstream upload reads from here and pushes to the customer-facing stores.
Roughly 110 columns split across ~8 logical groups. Identified by (dimension_keyword_hash, country) — a single keyword in a single country is one row.
Sharding & range scheme¶
| Layer | Value | Meaning |
|---|---|---|
| Shards | 200 | One row lives on exactly one of laksa000…laksa199, selected by farmFingerprint64(keyword) % 200. |
| Ranges per shard | 256 | Each shard's hash space is divided into 256 bands of 2^55 each. Range R covers hashes in [R·2^55, (R+1)·2^55 − 1]. |
| Total partitions | 51 200 | 200 × 256. Each (shard, range) is the unit of work for processing.py / pe_update.py. |
Why this geometry
- 200 shards matches the laksa fleet — one shard per host, so reads from a shard go to its host directly.
- 256 ranges per shard is sized to per-range memory + wall-clock budget. Each Slurm worker processes one range; the default
RANGE_DEADLINE_SECONDS = 5400(90 min) is sized for a typical range's keyword count. Smaller ranges = more Slurm overhead; larger = harder to fit in memory.
Cluster names
- DDL-time:
ON CLUSTER L0L1_R_P0P1— creates the table on all 200 shards atomically. Used bygsc.sqlonly. - Query-time:
laksa_replicated_prata_shard{NNN}— targets a single shard for reads/writes. Used bykvprocessor.cpp,processing.py,pe_update.py.
Engine: AggregatingMergeTree with per-column anyLast¶
Almost every column is declared as SimpleAggregateFunction(anyLast, Nullable(<T>)). Two consequences worth understanding before touching this table:
- Idempotent writes — repeated runs of
kvprocessororprocessing.pyare safe. Re-writing the same(hash, country)overwrites only the columns you supplied; the latest value per column wins on merge. - Multi-writer flexibility — different people / different processes can mutate different non-identity columns of the same row without coordinating.
processing.pywritesprocessed_*;pe_update.pywritesprocessed_pe_*andactual_pe_*; ML team'sml_orcwritesprocessed_attrs*andkeyword_embedding_store. None of them stomp the others because the merge engine takes the latest value per column.
Compare with the sibling
keywords.keywords_metrics_local, which usesReplicatedReplacingMergeTree— row-level replacement, appropriate because the upload writes the whole row as a single unit of truth.
Column groups¶
A short reference grouped by writer. Detailed column lists live on the source pages linked below; this page only enumerates the groups.
| Group | Writer | Source page |
|---|---|---|
Identity / keys (int_shard_id, dimension_keyword_hash, country, keyword) |
kvprocessor | — |
GSC raw (16 cols: gsc_*_array_store, gsc_top_domain_*, gsc_first/last_seen, gsc_last_*_days_impressions) |
kvprocessor | GSC |
GT raw (3: gt_month_array_store, gt_trend_array_store, gt_last_update) |
kvprocessor + extract_gt.py |
GT |
GKP raw (6: gkp_month_array_store, gkp_trend_array_store, gkp_low/high_top_of_page_bid, gkp_cpc, gkp_last_update) |
kvprocessor + extract_gkp.py |
GKP |
JS raw (~12: js_predicted_volumes_store, js_months_store, js_organic_p, js_users, js_clicks_*, js_searches_*, js_first_seen, js_last_update) |
kvprocessor (one-time) | Jumpshot |
Derived / processed (processed_keyword_volume, processed_global_*, processed_volume_trend, processed_growth, processed_cpc, processed_organic_p, processed_*_distribution, processed_ke_approved, processed_first_seen, processed_date, processed_volume_trend_meta) |
processing.py |
Fields (Phase 2) |
PE tracking — processed_pe_* (what pe_update.py decided this run) and actual_pe_* (what the PE system actually applied downstream) |
pe_update.py writes processed_pe_*; actual_pe_* is updated by the PE consumer |
pe-update (Phase 2) |
Attrs & embeddings (processed_attrs, processed_attrs_date, processed_attrs_version, processed_attrs_entities, keyword_embedding_store) |
ml_orc (FastAPI service on devkopi:9134 in ~/mimi/machinelearning/serve/ml_orc/) — not owned by this team |
external |
The _array_store columns are JSON-serialised arrays for portable storage; matching *_array ALIAS columns decode them back to typed arrays at query time.
Sibling: keywords.keywords_metrics_local¶
The stage-4 upload destination. Schema in databases/clickhouse/keywords/keywords_metrics_local.sql (in the monorepo). Different engine (ReplicatedReplacingMergeTree), different cluster (L0L1_R_P0P1_R_H0H1_R_C0C1_R_F0F1_R_CH0CH1), partitioned by (keyword_search_engine, keyword_search_platform). upload_keywords_metrics_ch.slurm extracts the relevant processed_* columns from keywords_data_local and writes a denormalised row per (keyword, country, search_engine, platform) here for downstream consumption.
Quirks and gotchas¶
_localvs_global—keywords_data_localis the per-shard hub. Source-side GSC tables follow the same pattern:gsc.aggregated_keywords_local(per-shard) → re-sharded intogsc.aggregated_keywords_global(single hash-sharded copy) for kvprocessor to read across shards. The naming is a ClickHouse convention.- ALIAS columns — many
*_arraycolumns are not stored; they're decoded at query time from the*_array_storeJSON string viaCAST(..., 'Array(...)'). Be careful: writing to an ALIAS column is an error; you write to_storeonly. - Replication — each laksa shard is paired with a prata replica. Queries target
laksa{NNN}.int.ahrefsby default withprata{NNN}as fallback. - Per-range deadlines —
processing.pyenforces a per-range wall-clock cap (envPROCESSING_RANGE_DEADLINE_SECONDS, default 5400s = 90 min) to keep one runaway range from holding up the hourly loop. - Top-domain arrays —
gsc_top_domain_array_storeandgsc_top_domain_position_array_storeare parallel arrays — indexiof the position array corresponds to indexiof the domain array.
See also¶
Internal:
- Architecture — full pipeline diagram
- GSC · GT · GKP · Jumpshot — raw-source detail
- Processes — kvprocessor, processing.py, pe_update, upload-backend (Phase 2)
- Fields — per-column reference for
processed_*(Phase 2) - Decisions — the load-bearing logic that mutates this table's
processed_*columns (Phase 3) gsc.sql(in this project root) — full DDL, including rawgsc_*source tables, materialised views, and the central hub itself.