pe_update.py — Stage 3 Position Explorer selection¶

Code: pe_update.py (the whole file). Last validated: 2026-05-21

What it does¶

Curates the top-N / bottom-N keywords that the Position Explorer feature tracks. Runs after processing.py writes its processed_* columns. Reads candidate keywords across all 200 shards × 256 ranges (scatter-gather), computes a growth-weighted score per keyword, picks the top N to add to PE tracking and the bottom N to remove, and writes the decisions to processed_pe_tracking / processed_pe_add_date / processed_pe_remove_date on the hub. Also dumps CSVs to /home/data/pe_update_logs/ for human review.

Selection logic¶

Anchor: pe_update.py:KB-ANCHOR:pe-cutoff-count. The shared constant nrows_to_replace = 400000 controls both the top-N add and the bottom-N remove.

Anchor: pe_update.py:KB-ANCHOR:pe-growth-weighted-score. The score formula:

score = processed_keyword_volume * (1.0 + growth_3m.clip(-50, 200) / 100.0)

The same formula is applied to the candidates for both the add path (sorted descending → take top 400 000) and the remove path (sorted ascending → take bottom 400 000). growth_3m is clipped to [-50, 200] so a single extreme growth value can't swamp the ranking.

Scatter-gather model¶

Pool size: parallel_runs = nshards * 2 = 400
Per shard×range query: timeout-bounded; on failure retries with laksa ↔ prata failover (3 attempts total)
Batch insert size: ~5 000 rows per CH insert
Insert worker count: 5–10 (configurable)

What gets written¶

Column	Meaning
`processed_pe_tracking`	The flag this run decided (`1` = should-be-tracked, `0` = should-not-be)
`processed_pe_add_date`	When the keyword first qualified for tracking
`processed_pe_remove_date`	When the keyword first qualified for removal
`actual_pe_tracking`, `actual_pe_last_update`, `actual_pe_deleted_at`	NOT written by this script — these are filled by the downstream PE consumer that reads `processed_pe_` and reports back what it actually applied. (Deferred: name the concrete consumer when Phase 5 lands the `consumers/pe-system.md` page.)*

The CSV dumps (pe_add_YYYYMMDD_HHMMSS.csv and pe_remove_YYYYMMDD_HHMMSS.csv) carry the same selection plus the score, processed_keyword_volume, and growth_3m columns so the decisions are reviewable after the fact.

Failure modes¶

Symptom	Cause	Behaviour
Shard query timeout	CH replica slow	3 retries with laksa↔prata flip; if still failing, batch skipped — those keywords miss this cycle's PE re-evaluation but next hour picks them up
CH insert failure	Replica unreachable / disk full	5 retries; on full exhaustion, log error and skip that batch
Empty candidates after concat	All shards skipped or all DataFrames empty	Logs and exits cleanly — no PE write that cycle

Runtime is indeterminate — depends on CH load and result-set sizes. No fixed budget.