Skip to content

pe_update.py — Stage 3 Position Explorer selection

Code: pe_update.py (the whole file). Last validated: 2026-05-21

What it does

Curates the top-N / bottom-N keywords that the Position Explorer feature tracks. Runs after processing.py writes its processed_* columns. Reads candidate keywords across all 200 shards × 256 ranges (scatter-gather), computes a growth-weighted score per keyword, picks the top N to add to PE tracking and the bottom N to remove, and writes the decisions to processed_pe_tracking / processed_pe_add_date / processed_pe_remove_date on the hub. Also dumps CSVs to /home/data/pe_update_logs/ for human review.

Selection logic

Anchor: pe_update.py:KB-ANCHOR:pe-cutoff-count. The shared constant nrows_to_replace = 400000 controls both the top-N add and the bottom-N remove.

Anchor: pe_update.py:KB-ANCHOR:pe-growth-weighted-score. The score formula:

score = processed_keyword_volume * (1.0 + growth_3m.clip(-50, 200) / 100.0)

The same formula is applied to the candidates for both the add path (sorted descending → take top 400 000) and the remove path (sorted ascending → take bottom 400 000). growth_3m is clipped to [-50, 200] so a single extreme growth value can't swamp the ranking.

Scatter-gather model

  • Pool size: parallel_runs = nshards * 2 = 400
  • Per shard×range query: timeout-bounded; on failure retries with laksa ↔ prata failover (3 attempts total)
  • Batch insert size: ~5 000 rows per CH insert
  • Insert worker count: 5–10 (configurable)

What gets written

Column Meaning
processed_pe_tracking The flag this run decided (1 = should-be-tracked, 0 = should-not-be)
processed_pe_add_date When the keyword first qualified for tracking
processed_pe_remove_date When the keyword first qualified for removal
actual_pe_tracking, actual_pe_last_update, actual_pe_deleted_at NOT written by this script — these are filled by the downstream PE consumer that reads processed_pe_* and reports back what it actually applied. (Deferred: name the concrete consumer when Phase 5 lands the consumers/pe-system.md page.)

The CSV dumps (pe_add_YYYYMMDD_HHMMSS.csv and pe_remove_YYYYMMDD_HHMMSS.csv) carry the same selection plus the score, processed_keyword_volume, and growth_3m columns so the decisions are reviewable after the fact.

Failure modes

Symptom Cause Behaviour
Shard query timeout CH replica slow 3 retries with laksa↔prata flip; if still failing, batch skipped — those keywords miss this cycle's PE re-evaluation but next hour picks them up
CH insert failure Replica unreachable / disk full 5 retries; on full exhaustion, log error and skip that batch
Empty candidates after concat All shards skipped or all DataFrames empty Logs and exits cleanly — no PE write that cycle

Runtime is indeterminate — depends on CH load and result-set sizes. No fixed budget.

See also

  • Architecture
  • Central hub table — read/write surface (incl. PE columns group)
  • processing.py — upstream stage (processed_keyword_volume, processed_growth are direct inputs)
  • Decisions — the growth-classification logic that feeds growth_3m
  • (Phase 5) consumers/pe-system.md — the downstream consumer that fills actual_pe_* (TBD)
  • _archive: _archive/pe_keywords.md — original detailed notes on the 3-pass add/remove/clean logic