extract_gt.py — Google Trends refresh extractor¶
Code: src/extract_gt.py. Invoked by extract_gkp_gt.sh independently of keep_running.sh.
Last validated: 2026-05-21
What it does¶
Picks keywords flagged for GT refresh (needs_gt=true inside processed_volume_trend_meta, written by processing.py's refresh-detection logic), fetches their Trends timeseries via SearchAPI, and writes results to keywords_volume.google_trends on isog. The hub table is not touched directly — kvprocessor.cpp pulls the refreshed data into keywords_data_local on its next iteration.
How it runs¶
Decoupled from the hourly hub-loop. extract_gkp_gt.sh polls for the process_keywords Slurm job to be idle, then runs:
python ./src/extract_gt.py --tier 1 # urgent refresh
python ./src/extract_gt.py --tier 2 # deferred refresh
Tier 1 is for keywords flagged "urgent" by refresh-detection; tier 2 is for the deferred long tail.
Architecture: 1 API dispatcher + 5 worker threads (parse + batch insert) + 1 progress monitor.
Load-bearing constants¶
| Constant | Slug | Value | Note |
|---|---|---|---|
| Per-run keyword cap | gt-tier-caps-per-run |
{1: 20000, 2: 13000} |
tier 1 = 20K, tier 2 = 13K |
| SearchAPI pacing | gt-searchapi-rate |
target_rate: float = 3.0 req/s |
token-bucket; drops 20% on HTTP 429 then recovers |
| CH insert retry → drop rows | gt-ch-insert-retry |
max_retries: int = 10 |
after 10 failures the rows are dropped, not retried next cycle |
Anchors live in src/extract_gt.py. The dropped-rows behaviour is the most consequential failure mode — see below.
Failure modes¶
| Symptom | Cause | Behaviour |
|---|---|---|
| SearchAPI 429 (rate limited) | Token bucket too aggressive | Reduce rate by 20% (recover_rate), sleep 5–10 s, retry up to 3× per keyword; then abandon that keyword |
| SearchAPI timeout / 5xx | Upstream blip | Retry with jitter; abandon after 3× |
| Shard scan timeout (selecting candidate keywords from hub) | CH slow | 2400 s per shard timeout, 5 failover attempts (laksa ↔ prata), then log and skip that shard |
CH insert into keywords_volume.google_trends fails |
isog slow / restarting | 10 retries with exponential back-off (5·2^attempt, cap 120 s); on full exhaustion, rows are dropped — keyword_volume_kb_access.log will show "DROPPING N rows after 10 retries" |
The "drop rows after exhaustion" path means a small fraction of refresh attempts can be silently lost during isog incidents. The next refresh cycle will only retry those keywords if processing.py re-flags them as needs_gt.
Runtime is indeterminate — depends on SearchAPI latency, queue depth, and current isog write throughput.
Auth & config¶
- SearchAPI key:
/home/team/mimi/data/searchapi_key.txt - ClickHouse:
isog.int.ahrefs:9001, admin creds from/etc/clickhouse-client/config.json - Staging table reuse TTL:
STAGING_FRESHNESS_HOURS = 6(lets repeated runs within 6 h re-use the candidate list)
See also¶
- GT source page — what GT carries and the validity gates downstream of this extractor
- Central hub table — where
needs_gtlives (inprocessed_volume_trend_meta) and where the refreshed data eventually lands (after kvprocessor pulls it) - extract_gkp.py — the sibling extractor with a similar architecture but different API
- SearchAPI Google Trends docs — upstream API
extract_gkp_gt.sh(in project root) — the wrapper that schedules both extractors- Anchors in source:
src/extract_gt.py:KB-ANCHOR:gt-tier-caps-per-run,gt-searchapi-rate,gt-ch-insert-retry