pe_update.py — Stage 3 Position Explorer selection¶
Code: pe_update.py (the whole file).
Last validated: 2026-05-21
What it does¶
Curates the top-N / bottom-N keywords that the Position Explorer feature tracks. Runs after processing.py writes its processed_* columns. Reads candidate keywords across all 200 shards × 256 ranges (scatter-gather), computes a growth-weighted score per keyword, picks the top N to add to PE tracking and the bottom N to remove, and writes the decisions to processed_pe_tracking / processed_pe_add_date / processed_pe_remove_date on the hub. Also dumps CSVs to /home/data/pe_update_logs/ for human review.
Selection logic¶
Anchor: pe_update.py:KB-ANCHOR:pe-cutoff-count. The shared constant nrows_to_replace = 400000 controls both the top-N add and the bottom-N remove.
Anchor: pe_update.py:KB-ANCHOR:pe-growth-weighted-score. The score formula:
The same formula is applied to the candidates for both the add path (sorted descending → take top 400 000) and the remove path (sorted ascending → take bottom 400 000). growth_3m is clipped to [-50, 200] so a single extreme growth value can't swamp the ranking.
Scatter-gather model¶
- Pool size:
parallel_runs = nshards * 2 = 400 - Per shard×range query: timeout-bounded; on failure retries with laksa ↔ prata failover (3 attempts total)
- Batch insert size: ~5 000 rows per CH insert
- Insert worker count: 5–10 (configurable)
What gets written¶
| Column | Meaning |
|---|---|
processed_pe_tracking |
The flag this run decided (1 = should-be-tracked, 0 = should-not-be) |
processed_pe_add_date |
When the keyword first qualified for tracking |
processed_pe_remove_date |
When the keyword first qualified for removal |
actual_pe_tracking, actual_pe_last_update, actual_pe_deleted_at |
NOT written by this script — these are filled by the downstream PE consumer that reads processed_pe_* and reports back what it actually applied. (Deferred: name the concrete consumer when Phase 5 lands the consumers/pe-system.md page.) |
The CSV dumps (pe_add_YYYYMMDD_HHMMSS.csv and pe_remove_YYYYMMDD_HHMMSS.csv) carry the same selection plus the score, processed_keyword_volume, and growth_3m columns so the decisions are reviewable after the fact.
Failure modes¶
| Symptom | Cause | Behaviour |
|---|---|---|
| Shard query timeout | CH replica slow | 3 retries with laksa↔prata flip; if still failing, batch skipped — those keywords miss this cycle's PE re-evaluation but next hour picks them up |
| CH insert failure | Replica unreachable / disk full | 5 retries; on full exhaustion, log error and skip that batch |
| Empty candidates after concat | All shards skipped or all DataFrames empty | Logs and exits cleanly — no PE write that cycle |
Runtime is indeterminate — depends on CH load and result-set sizes. No fixed budget.
See also¶
- Architecture
- Central hub table — read/write surface (incl. PE columns group)
- processing.py — upstream stage (
processed_keyword_volume,processed_growthare direct inputs) - Decisions — the growth-classification logic that feeds
growth_3m - (Phase 5)
consumers/pe-system.md— the downstream consumer that fillsactual_pe_*(TBD) - _archive:
_archive/pe_keywords.md— original detailed notes on the 3-pass add/remove/clean logic