Skip to content

keywords.keywords_data_local — the central hub

Schema: gsc.sql:L89–L193. Engine: ReplicatedAggregatingMergeTree. Order: (dimension_keyword_hash, country). Last validated: 2026-05-21

What it is

The pipeline's single shared table. Every raw source array lands here (kvprocessor writes), every derived column is written back in place (processing.py), PE tracking flags are mutated here (pe_update.py), and external services (ML team's ml_orc) write their own columns to the same rows without coordinating with the keyword-volume team. The downstream upload reads from here and pushes to the customer-facing stores.

Roughly 110 columns split across ~8 logical groups. Identified by (dimension_keyword_hash, country) — a single keyword in a single country is one row.

Sharding & range scheme

Layer Value Meaning
Shards 200 One row lives on exactly one of laksa000…laksa199, selected by farmFingerprint64(keyword) % 200.
Ranges per shard 256 Each shard's hash space is divided into 256 bands of 2^55 each. Range R covers hashes in [R·2^55, (R+1)·2^55 − 1].
Total partitions 51 200 200 × 256. Each (shard, range) is the unit of work for processing.py / pe_update.py.

Why this geometry

  • 200 shards matches the laksa fleet — one shard per host, so reads from a shard go to its host directly.
  • 256 ranges per shard is sized to per-range memory + wall-clock budget. Each Slurm worker processes one range; the default RANGE_DEADLINE_SECONDS = 5400 (90 min) is sized for a typical range's keyword count. Smaller ranges = more Slurm overhead; larger = harder to fit in memory.

Cluster names

  • DDL-time: ON CLUSTER L0L1_R_P0P1 — creates the table on all 200 shards atomically. Used by gsc.sql only.
  • Query-time: laksa_replicated_prata_shard{NNN} — targets a single shard for reads/writes. Used by kvprocessor.cpp, processing.py, pe_update.py.

Engine: AggregatingMergeTree with per-column anyLast

Almost every column is declared as SimpleAggregateFunction(anyLast, Nullable(<T>)). Two consequences worth understanding before touching this table:

  1. Idempotent writes — repeated runs of kvprocessor or processing.py are safe. Re-writing the same (hash, country) overwrites only the columns you supplied; the latest value per column wins on merge.
  2. Multi-writer flexibilitydifferent people / different processes can mutate different non-identity columns of the same row without coordinating. processing.py writes processed_*; pe_update.py writes processed_pe_* and actual_pe_*; ML team's ml_orc writes processed_attrs* and keyword_embedding_store. None of them stomp the others because the merge engine takes the latest value per column.

Compare with the sibling keywords.keywords_metrics_local, which uses ReplicatedReplacingMergeTree — row-level replacement, appropriate because the upload writes the whole row as a single unit of truth.

Column groups

A short reference grouped by writer. Detailed column lists live on the source pages linked below; this page only enumerates the groups.

Group Writer Source page
Identity / keys (int_shard_id, dimension_keyword_hash, country, keyword) kvprocessor
GSC raw (16 cols: gsc_*_array_store, gsc_top_domain_*, gsc_first/last_seen, gsc_last_*_days_impressions) kvprocessor GSC
GT raw (3: gt_month_array_store, gt_trend_array_store, gt_last_update) kvprocessor + extract_gt.py GT
GKP raw (6: gkp_month_array_store, gkp_trend_array_store, gkp_low/high_top_of_page_bid, gkp_cpc, gkp_last_update) kvprocessor + extract_gkp.py GKP
JS raw (~12: js_predicted_volumes_store, js_months_store, js_organic_p, js_users, js_clicks_*, js_searches_*, js_first_seen, js_last_update) kvprocessor (one-time) Jumpshot
Derived / processed (processed_keyword_volume, processed_global_*, processed_volume_trend, processed_growth, processed_cpc, processed_organic_p, processed_*_distribution, processed_ke_approved, processed_first_seen, processed_date, processed_volume_trend_meta) processing.py Fields (Phase 2)
PE tracking — processed_pe_* (what pe_update.py decided this run) and actual_pe_* (what the PE system actually applied downstream) pe_update.py writes processed_pe_*; actual_pe_* is updated by the PE consumer pe-update (Phase 2)
Attrs & embeddings (processed_attrs, processed_attrs_date, processed_attrs_version, processed_attrs_entities, keyword_embedding_store) ml_orc (FastAPI service on devkopi:9134 in ~/mimi/machinelearning/serve/ml_orc/) — not owned by this team external

The _array_store columns are JSON-serialised arrays for portable storage; matching *_array ALIAS columns decode them back to typed arrays at query time.

Sibling: keywords.keywords_metrics_local

The stage-4 upload destination. Schema in databases/clickhouse/keywords/keywords_metrics_local.sql (in the monorepo). Different engine (ReplicatedReplacingMergeTree), different cluster (L0L1_R_P0P1_R_H0H1_R_C0C1_R_F0F1_R_CH0CH1), partitioned by (keyword_search_engine, keyword_search_platform). upload_keywords_metrics_ch.slurm extracts the relevant processed_* columns from keywords_data_local and writes a denormalised row per (keyword, country, search_engine, platform) here for downstream consumption.

Quirks and gotchas

  • _local vs _globalkeywords_data_local is the per-shard hub. Source-side GSC tables follow the same pattern: gsc.aggregated_keywords_local (per-shard) → re-sharded into gsc.aggregated_keywords_global (single hash-sharded copy) for kvprocessor to read across shards. The naming is a ClickHouse convention.
  • ALIAS columns — many *_array columns are not stored; they're decoded at query time from the *_array_store JSON string via CAST(..., 'Array(...)'). Be careful: writing to an ALIAS column is an error; you write to _store only.
  • Replication — each laksa shard is paired with a prata replica. Queries target laksa{NNN}.int.ahrefs by default with prata{NNN} as fallback.
  • Per-range deadlinesprocessing.py enforces a per-range wall-clock cap (env PROCESSING_RANGE_DEADLINE_SECONDS, default 5400s = 90 min) to keep one runaway range from holding up the hourly loop.
  • Top-domain arraysgsc_top_domain_array_store and gsc_top_domain_position_array_store are parallel arrays — index i of the position array corresponds to index i of the domain array.

See also

Internal:

  • Architecture — full pipeline diagram
  • GSC · GT · GKP · Jumpshot — raw-source detail
  • Processes — kvprocessor, processing.py, pe_update, upload-backend (Phase 2)
  • Fields — per-column reference for processed_* (Phase 2)
  • Decisions — the load-bearing logic that mutates this table's processed_* columns (Phase 3)
  • gsc.sql (in this project root) — full DDL, including raw gsc_* source tables, materialised views, and the central hub itself.