Skip to content

extract_gt.py — Google Trends refresh extractor

Code: src/extract_gt.py. Invoked by extract_gkp_gt.sh independently of keep_running.sh. Last validated: 2026-05-21

What it does

Picks keywords flagged for GT refresh (needs_gt=true inside processed_volume_trend_meta, written by processing.py's refresh-detection logic), fetches their Trends timeseries via SearchAPI, and writes results to keywords_volume.google_trends on isog. The hub table is not touched directly — kvprocessor.cpp pulls the refreshed data into keywords_data_local on its next iteration.

How it runs

Decoupled from the hourly hub-loop. extract_gkp_gt.sh polls for the process_keywords Slurm job to be idle, then runs:

python ./src/extract_gt.py --tier 1   # urgent refresh
python ./src/extract_gt.py --tier 2   # deferred refresh

Tier 1 is for keywords flagged "urgent" by refresh-detection; tier 2 is for the deferred long tail.

Architecture: 1 API dispatcher + 5 worker threads (parse + batch insert) + 1 progress monitor.

Load-bearing constants

Constant Slug Value Note
Per-run keyword cap gt-tier-caps-per-run {1: 20000, 2: 13000} tier 1 = 20K, tier 2 = 13K
SearchAPI pacing gt-searchapi-rate target_rate: float = 3.0 req/s token-bucket; drops 20% on HTTP 429 then recovers
CH insert retry → drop rows gt-ch-insert-retry max_retries: int = 10 after 10 failures the rows are dropped, not retried next cycle

Anchors live in src/extract_gt.py. The dropped-rows behaviour is the most consequential failure mode — see below.

Failure modes

Symptom Cause Behaviour
SearchAPI 429 (rate limited) Token bucket too aggressive Reduce rate by 20% (recover_rate), sleep 5–10 s, retry up to 3× per keyword; then abandon that keyword
SearchAPI timeout / 5xx Upstream blip Retry with jitter; abandon after 3×
Shard scan timeout (selecting candidate keywords from hub) CH slow 2400 s per shard timeout, 5 failover attempts (laksa ↔ prata), then log and skip that shard
CH insert into keywords_volume.google_trends fails isog slow / restarting 10 retries with exponential back-off (5·2^attempt, cap 120 s); on full exhaustion, rows are dropped — keyword_volume_kb_access.log will show "DROPPING N rows after 10 retries"

The "drop rows after exhaustion" path means a small fraction of refresh attempts can be silently lost during isog incidents. The next refresh cycle will only retry those keywords if processing.py re-flags them as needs_gt.

Runtime is indeterminate — depends on SearchAPI latency, queue depth, and current isog write throughput.

Auth & config

  • SearchAPI key: /home/team/mimi/data/searchapi_key.txt
  • ClickHouse: isog.int.ahrefs:9001, admin creds from /etc/clickhouse-client/config.json
  • Staging table reuse TTL: STAGING_FRESHNESS_HOURS = 6 (lets repeated runs within 6 h re-use the candidate list)

See also

  • GT source page — what GT carries and the validity gates downstream of this extractor
  • Central hub table — where needs_gt lives (in processed_volume_trend_meta) and where the refreshed data eventually lands (after kvprocessor pulls it)
  • extract_gkp.py — the sibling extractor with a similar architecture but different API
  • SearchAPI Google Trends docs — upstream API
  • extract_gkp_gt.sh (in project root) — the wrapper that schedules both extractors
  • Anchors in source: src/extract_gt.py:KB-ANCHOR:gt-tier-caps-per-run, gt-searchapi-rate, gt-ch-insert-retry