extract_gt.py — Google Trends refresh extractor¶

Code: src/extract_gt.py. Invoked by extract_gkp_gt.sh independently of keep_running.sh. Last validated: 2026-05-21

What it does¶

Picks keywords flagged for GT refresh (needs_gt=true inside processed_volume_trend_meta, written by processing.py's refresh-detection logic), fetches their Trends timeseries via SearchAPI, and writes results to keywords_volume.google_trends on isog. The hub table is not touched directly — kvprocessor.cpp pulls the refreshed data into keywords_data_local on its next iteration.

How it runs¶

Decoupled from the hourly hub-loop. extract_gkp_gt.sh polls for the process_keywords Slurm job to be idle, then runs:

python ./src/extract_gt.py --tier 1   # urgent refresh
python ./src/extract_gt.py --tier 2   # deferred refresh

Tier 1 is for keywords flagged "urgent" by refresh-detection; tier 2 is for the deferred long tail.

Architecture: 1 API dispatcher + 5 worker threads (parse + batch insert) + 1 progress monitor.

Load-bearing constants¶

Constant	Slug	Value	Note
Per-run keyword cap	`gt-tier-caps-per-run`	`{1: 20000, 2: 13000}`	tier 1 = 20K, tier 2 = 13K
SearchAPI pacing	`gt-searchapi-rate`	`target_rate: float = 3.0` req/s	token-bucket; drops 20% on HTTP 429 then recovers
CH insert retry → drop rows	`gt-ch-insert-retry`	`max_retries: int = 10`	after 10 failures the rows are dropped, not retried next cycle

Anchors live in src/extract_gt.py. The dropped-rows behaviour is the most consequential failure mode — see below.

Failure modes¶

Symptom	Cause	Behaviour
SearchAPI 429 (rate limited)	Token bucket too aggressive	Reduce rate by 20% (`recover_rate`), sleep 5–10 s, retry up to 3× per keyword; then abandon that keyword
SearchAPI timeout / 5xx	Upstream blip	Retry with jitter; abandon after 3×
Shard scan timeout (selecting candidate keywords from hub)	CH slow	2400 s per shard timeout, 5 failover attempts (laksa ↔ prata), then log and skip that shard
CH insert into `keywords_volume.google_trends` fails	isog slow / restarting	10 retries with exponential back-off (5·2^attempt, cap 120 s); on full exhaustion, rows are dropped — `keyword_volume_kb_access.log` will show "DROPPING N rows after 10 retries"

The "drop rows after exhaustion" path means a small fraction of refresh attempts can be silently lost during isog incidents. The next refresh cycle will only retry those keywords if processing.py re-flags them as needs_gt.

Runtime is indeterminate — depends on SearchAPI latency, queue depth, and current isog write throughput.

Auth & config¶

SearchAPI key: /home/team/mimi/data/searchapi_key.txt
ClickHouse: isog.int.ahrefs:9001, admin creds from /etc/clickhouse-client/config.json
Staging table reuse TTL: STAGING_FRESHNESS_HOURS = 6 (lets repeated runs within 6 h re-use the candidate list)