keywords.keywords_data_local — the central hub¶

Schema: gsc.sql:L89–L193. Engine: ReplicatedAggregatingMergeTree. Order: (dimension_keyword_hash, country). Last validated: 2026-05-21

What it is¶

The pipeline's single shared table. Every raw source array lands here (kvprocessor writes), every derived column is written back in place (processing.py), PE tracking flags are mutated here (pe_update.py), and external services (ML team's ml_orc) write their own columns to the same rows without coordinating with the keyword-volume team. The downstream upload reads from here and pushes to the customer-facing stores.

Roughly 110 columns split across ~8 logical groups. Identified by (dimension_keyword_hash, country) — a single keyword in a single country is one row.

Sharding & range scheme¶

Layer	Value	Meaning
Shards	200	One row lives on exactly one of `laksa000…laksa199`, selected by `farmFingerprint64(keyword) % 200`.
Ranges per shard	256	Each shard's hash space is divided into 256 bands of `2^55` each. Range `R` covers hashes in `[R·2^55, (R+1)·2^55 − 1]`.
Total partitions	51 200	200 × 256. Each `(shard, range)` is the unit of work for `processing.py` / `pe_update.py`.

Why this geometry

200 shards matches the laksa fleet — one shard per host, so reads from a shard go to its host directly.
256 ranges per shard is sized to per-range memory + wall-clock budget. Each Slurm worker processes one range; the default RANGE_DEADLINE_SECONDS = 5400 (90 min) is sized for a typical range's keyword count. Smaller ranges = more Slurm overhead; larger = harder to fit in memory.

Cluster names

DDL-time: ON CLUSTER L0L1_R_P0P1 — creates the table on all 200 shards atomically. Used by gsc.sql only.
Query-time: laksa_replicated_prata_shard{NNN} — targets a single shard for reads/writes. Used by kvprocessor.cpp, processing.py, pe_update.py.

Engine: `AggregatingMergeTree` with per-column `anyLast`¶

Almost every column is declared as SimpleAggregateFunction(anyLast, Nullable(<T>)). Two consequences worth understanding before touching this table:

Idempotent writes — repeated runs of kvprocessor or processing.py are safe. Re-writing the same (hash, country) overwrites only the columns you supplied; the latest value per column wins on merge.
Multi-writer flexibility — different people / different processes can mutate different non-identity columns of the same row without coordinating. processing.py writes processed_*; pe_update.py writes processed_pe_* and actual_pe_*; ML team's ml_orc writes processed_attrs* and keyword_embedding_store. None of them stomp the others because the merge engine takes the latest value per column.

Compare with the sibling keywords.keywords_metrics_local, which uses ReplicatedReplacingMergeTree — row-level replacement, appropriate because the upload writes the whole row as a single unit of truth.

Column groups¶

A short reference grouped by writer. Detailed column lists live on the source pages linked below; this page only enumerates the groups.

Group	Writer	Source page
Identity / keys (`int_shard_id`, `dimension_keyword_hash`, `country`, `keyword`)	kvprocessor	—
GSC raw (16 cols: `gsc__array_store`, `gsc_top_domain_`, `gsc_first/last_seen`, `gsc_last_*_days_impressions`)	kvprocessor	GSC
GT raw (3: `gt_month_array_store`, `gt_trend_array_store`, `gt_last_update`)	kvprocessor + `extract_gt.py`	GT
GKP raw (6: `gkp_month_array_store`, `gkp_trend_array_store`, `gkp_low/high_top_of_page_bid`, `gkp_cpc`, `gkp_last_update`)	kvprocessor + `extract_gkp.py`	GKP
JS raw (~12: `js_predicted_volumes_store`, `js_months_store`, `js_organic_p`, `js_users`, `js_clicks_`, `js_searches_`, `js_first_seen`, `js_last_update`)	kvprocessor (one-time)	Jumpshot
Derived / processed (`processed_keyword_volume`, `processed_global_`, `processed_volume_trend`, `processed_growth`, `processed_cpc`, `processed_organic_p`, `processed__distribution`, `processed_ke_approved`, `processed_first_seen`, `processed_date`, `processed_volume_trend_meta`)	`processing.py`	Fields (Phase 2)
PE tracking — `processed_pe_` (what `pe_update.py` decided* this run) and `actual_pe_` (what the PE system actually applied* downstream)	`pe_update.py` writes `processed_pe_`; `actual_pe_` is updated by the PE consumer	pe-update (Phase 2)
Attrs & embeddings (`processed_attrs`, `processed_attrs_date`, `processed_attrs_version`, `processed_attrs_entities`, `keyword_embedding_store`)	`ml_orc` (FastAPI service on `devkopi:9134` in `~/mimi/machinelearning/serve/ml_orc/`) — not owned by this team	external

The _array_store columns are JSON-serialised arrays for portable storage; matching *_array ALIAS columns decode them back to typed arrays at query time.

Sibling: `keywords.keywords_metrics_local`¶

The stage-4 upload destination. Schema in databases/clickhouse/keywords/keywords_metrics_local.sql (in the monorepo). Different engine (ReplicatedReplacingMergeTree), different cluster (L0L1_R_P0P1_R_H0H1_R_C0C1_R_F0F1_R_CH0CH1), partitioned by (keyword_search_engine, keyword_search_platform). upload_keywords_metrics_ch.slurm extracts the relevant processed_* columns from keywords_data_local and writes a denormalised row per (keyword, country, search_engine, platform) here for downstream consumption.

Quirks and gotchas¶

_local vs _global — keywords_data_local is the per-shard hub. Source-side GSC tables follow the same pattern: gsc.aggregated_keywords_local (per-shard) → re-sharded into gsc.aggregated_keywords_global (single hash-sharded copy) for kvprocessor to read across shards. The naming is a ClickHouse convention.
ALIAS columns — many *_array columns are not stored; they're decoded at query time from the *_array_store JSON string via CAST(..., 'Array(...)'). Be careful: writing to an ALIAS column is an error; you write to _store only.
Replication — each laksa shard is paired with a prata replica. Queries target laksa{NNN}.int.ahrefs by default with prata{NNN} as fallback.
Per-range deadlines — processing.py enforces a per-range wall-clock cap (env PROCESSING_RANGE_DEADLINE_SECONDS, default 5400s = 90 min) to keep one runaway range from holding up the hourly loop.
Top-domain arrays — gsc_top_domain_array_store and gsc_top_domain_position_array_store are parallel arrays — index i of the position array corresponds to index i of the domain array.