Skip to content

processing.py — Stage 2 derivation

Code: processing.py (the whole file). Entry: python processing.py <server_id> <range_id>. Last validated: 2026-05-21

What it does

The heart of the pipeline. For each (shard, range) pair, reads the raw arrays kvprocessor just wrote, computes every derived column (processed_keyword_volume, processed_volume_trend, processed_growth, processed_cpc, processed_organic_p, processed_*_distribution, processed_ke_approved, processed_volume_trend_meta, …), and writes them back to the same table. This is where the ~28 load-bearing algorithmic decisions live — see Decisions and the project README.md for the KB-ANCHOR drift-defense system.

How it runs

  • Slurm: process_keywords.slurm launches srun across 200 nodes (one per shard), each spawning 4 parallel xargs workers. Since each shard has 256 ranges, that's 64 ranges per worker per node. Workers stagger their launch by 0–10 s to avoid a connection storm against ClickHouse.
  • Per-keyword loop: an infinite while True reads ~25K keywords at a time from the shard using a last_hash cursor, derives the columns, writes back via INSERT INTO FUNCTION cluster(...), advances the cursor.
  • Preflight: the Slurm wrapper runs a CH health check before launching workers.

Per-range deadline

Anchor: processing.py:KB-ANCHOR:range-deadline-config.

A wall-clock cap (env PROCESSING_RANGE_DEADLINE_SECONDS, default 5400 s = 90 min) prevents one runaway range from blocking the hourly loop. check_deadline() is called at the top of every keyword batch and after every ClickHouse retry; once elapsed time exceeds the cap, the script exits with code 2.

Checkpoint mechanism (code exists, not used in production)

Anchor: processing.py:KB-ANCHOR:checkpoint-resume.

load_checkpoint() / save_checkpoint() persist the last successfully-written hash to $PROCESSING_CHECKPOINT_DIR/checkpoint_{server:03d}_{range:03d}.txt. The checkpoint is written every successful batch.

Important: the resumption mechanism only matters if some downstream automation re-runs a range whose previous run hit the deadline. In production today, no such retry is wired up. The recovery script (process_keywords_recovery.slurm) exists in ~/keyword-volume/ only and is not deployed in the hourly rotation (see memory project_processing_recovery_script). The practical consequence is: a range that exceeds its deadline simply drops the keywords it didn't reach for that hour. The upload to ES still proceeds with whatever did complete, and the next hourly cycle picks up those missed keywords on its own (because the inputs haven't changed but the keyword distribution across batches will).

ClickHouse insert retries

Anchor: processing.py:KB-ANCHOR:ch-insert-retry-count.

insert_data_into_clickhouse(server_id, data, max_retries=20) retries failed CH inserts up to 20 times with jitter. After exhausting retries the script exits with code 1. Per-keyword exceptions during derivation (bad JSON in a trend array, etc.) are caught and the keyword is skipped without crashing the worker.

Failure modes summary

Symptom Cause Behaviour
Per-keyword exception Bad JSON in stored array, divide-by-zero in growth, etc. Log + skip to next keyword
CH temporary unavailability Replica restart, network blip Up to 20 retries with jitter → if still failing, exit(1)
Wall-clock deadline reached Range too large, CH slow, runaway forecast exit(2) mid-range; no retry/resume in production; remaining keywords wait for next hourly cycle
Slurm node failure Hardware/OS-level Slurm reports the failure; that shard's range goes unprocessed; next cycle re-tries

Runtime per range is indeterminate — varies with CH load, forecast model fit times, source data shape. No use planning on a fixed duration.

See also

  • Architecture — pipeline overview
  • Central hub table — read/write surface
  • Decisions — the ~28 load-bearing algorithmic choices that live in this file
  • KB-ANCHOR drift-defense system (see project README.md at kb/README.md for the install + workflow)
  • process_keywords.slurm (in project root) — the Slurm job file
  • keep-running.md (Batch 3c) — the orchestrator
  • Memory: project_processing_recovery_script (recovery slurm location)