Skip to content

extract_gkp.py — Google Keyword Planner refresh extractor

Code: src/extract_gkp.py. Invoked by extract_gkp_gt.sh independently of keep_running.sh. Last validated: 2026-05-21

What it does

Picks keywords flagged for GKP refresh (needs_gkp=true inside processed_volume_trend_meta), fetches their absolute monthly volumes + CPC via DataForSEO, and writes to keywords_volume.gkp_trends_raw on isog. Like extract_gt.py, this is decoupled from the hub loop — kvprocessor.cpp pulls the refreshed data into keywords_data_local on its next run.

How it runs

Architecture: 1 API dispatcher (paces requests under DFSEO's rate cap) + 10 worker threads (merge old + new trends, batch-insert) + 1 progress monitor.

Per-shard candidate cap default 100 000 keywords to keep the staging table size bounded. Batch size for DFSEO API requests is up to ~1 000 keywords per call.

Load-bearing constants

Constant Slug Value Note
DataForSEO request rate gkp-dfseo-rate-limit queries_per_minute = 10 DFSEO documented hard cap is 12 req/min; we leave headroom
API retry + invalid-char strip gkp-api-retry-strip nretries = 3, max_strips = 5 DFSEO returns error code 40501 for keywords with restricted symbols/emojis; we strip the offending keyword up to 5 times, then abandon the batch
CH insert retry → FATAL gkp-ch-insert-retry max_retries: int = 20 20 retries then raises a FATAL exception — much harsher than GT's "drop rows and continue"

Anchors live in src/extract_gkp.py.

Failure modes

Symptom Cause Behaviour
DFSEO API timeout / generic error Upstream blip 3 retries with 20–30 s sleep, then abandon batch
DFSEO 40501 (restricted symbols / adult content) Keyword violates Google Ads policy Strip offending keyword up to 5 times; if still failing, abandon batch. The keyword stays flagged needs_gkp=true for next cycle.
Dispatcher pacing drift Throttle algorithm misfires DFSEO may reject subsequent requests until the per-minute window resets — API quota burned for that window
Shard scan timeout CH slow 2400 s per shard, 5 failover attempts (laksa ↔ prata), log and skip
CH insert into gkp_trends_raw fails isog slow / restarting 20 retries with 10–20 s random backoff; on full exhaustion, raises a FATAL exception that stops the run. This is intentionally harsher than GT's drop-rows behaviour because GKP data is more expensive to re-fetch (cost-per-call billing).

The FATAL-on-exhaustion path is the operationally distinct behaviour vs. extract_gt.py. It means a sustained isog write outage halts GKP refresh and someone has to restart the script.

Runtime is indeterminate — depends on DFSEO latency, batch sizes that complete, and CH write throughput.

Auth & config

  • DataForSEO key: /home/team/mimi/data/dataforseo_key.txt (2-line file: username + password)
  • ClickHouse: isog.int.ahrefs:9001, admin creds from /etc/clickhouse-client/config.json
  • Staging table reuse TTL: STAGING_FRESHNESS_HOURS = 24 (wider than GT's 6 h because GKP refresh is less frequent)

Important interaction with GKP all-zero handling

The "GKP returns null/zero for keywords prohibited under Google Ads policy" caveat (adult content, emojis, restricted symbols) is documented on the GKP source page. That caveat is partly why this extractor strips offending keywords rather than treating the 40501 response as a failure — but downstream, the gkp-validity-with-js-fallback and gt-gsc-correlation-gate decisions need to know that an all-zero GKP series could be a prohibited-keyword artefact rather than a true zero.

See also

  • GKP source page — what GKP carries, the validity gates, and the prohibited-keywords caveat
  • Central hub table — where needs_gkp lives and where refreshed data eventually lands
  • extract_gt.py — sibling extractor; same architecture pattern, different API + failure semantics
  • DataForSEO Google Ads — Search Volume Live — upstream API docs
  • extract_gkp_gt.sh (in project root) — wrapper scheduler
  • Anchors: src/extract_gkp.py:KB-ANCHOR:gkp-dfseo-rate-limit, gkp-api-retry-strip, gkp-ch-insert-retry