processing.py — Stage 2 derivation¶
Code: processing.py (the whole file). Entry: python processing.py <server_id> <range_id>.
Last validated: 2026-05-21
What it does¶
The heart of the pipeline. For each (shard, range) pair, reads the raw arrays kvprocessor just wrote, computes every derived column (processed_keyword_volume, processed_volume_trend, processed_growth, processed_cpc, processed_organic_p, processed_*_distribution, processed_ke_approved, processed_volume_trend_meta, …), and writes them back to the same table. This is where the ~28 load-bearing algorithmic decisions live — see Decisions and the project README.md for the KB-ANCHOR drift-defense system.
How it runs¶
- Slurm:
process_keywords.slurmlaunchessrunacross 200 nodes (one per shard), each spawning 4 parallelxargsworkers. Since each shard has 256 ranges, that's 64 ranges per worker per node. Workers stagger their launch by 0–10 s to avoid a connection storm against ClickHouse. - Per-keyword loop: an infinite
while Truereads ~25K keywords at a time from the shard using alast_hashcursor, derives the columns, writes back viaINSERT INTO FUNCTION cluster(...), advances the cursor. - Preflight: the Slurm wrapper runs a CH health check before launching workers.
Per-range deadline¶
Anchor: processing.py:KB-ANCHOR:range-deadline-config.
A wall-clock cap (env PROCESSING_RANGE_DEADLINE_SECONDS, default 5400 s = 90 min) prevents one runaway range from blocking the hourly loop. check_deadline() is called at the top of every keyword batch and after every ClickHouse retry; once elapsed time exceeds the cap, the script exits with code 2.
Checkpoint mechanism (code exists, not used in production)¶
Anchor: processing.py:KB-ANCHOR:checkpoint-resume.
load_checkpoint() / save_checkpoint() persist the last successfully-written hash to $PROCESSING_CHECKPOINT_DIR/checkpoint_{server:03d}_{range:03d}.txt. The checkpoint is written every successful batch.
Important: the resumption mechanism only matters if some downstream automation re-runs a range whose previous run hit the deadline. In production today, no such retry is wired up. The recovery script (process_keywords_recovery.slurm) exists in ~/keyword-volume/ only and is not deployed in the hourly rotation (see memory project_processing_recovery_script). The practical consequence is: a range that exceeds its deadline simply drops the keywords it didn't reach for that hour. The upload to ES still proceeds with whatever did complete, and the next hourly cycle picks up those missed keywords on its own (because the inputs haven't changed but the keyword distribution across batches will).
ClickHouse insert retries¶
Anchor: processing.py:KB-ANCHOR:ch-insert-retry-count.
insert_data_into_clickhouse(server_id, data, max_retries=20) retries failed CH inserts up to 20 times with jitter. After exhausting retries the script exits with code 1. Per-keyword exceptions during derivation (bad JSON in a trend array, etc.) are caught and the keyword is skipped without crashing the worker.
Failure modes summary¶
| Symptom | Cause | Behaviour |
|---|---|---|
| Per-keyword exception | Bad JSON in stored array, divide-by-zero in growth, etc. | Log + skip to next keyword |
| CH temporary unavailability | Replica restart, network blip | Up to 20 retries with jitter → if still failing, exit(1) |
| Wall-clock deadline reached | Range too large, CH slow, runaway forecast | exit(2) mid-range; no retry/resume in production; remaining keywords wait for next hourly cycle |
| Slurm node failure | Hardware/OS-level | Slurm reports the failure; that shard's range goes unprocessed; next cycle re-tries |
Runtime per range is indeterminate — varies with CH load, forecast model fit times, source data shape. No use planning on a fixed duration.
See also¶
- Architecture — pipeline overview
- Central hub table — read/write surface
- Decisions — the ~28 load-bearing algorithmic choices that live in this file
- KB-ANCHOR drift-defense system (see project
README.mdatkb/README.mdfor the install + workflow) process_keywords.slurm(in project root) — the Slurm job filekeep-running.md(Batch 3c) — the orchestrator- Memory:
project_processing_recovery_script(recovery slurm location)