extract_gkp.py — Google Keyword Planner refresh extractor¶
Code: src/extract_gkp.py. Invoked by extract_gkp_gt.sh independently of keep_running.sh.
Last validated: 2026-05-21
What it does¶
Picks keywords flagged for GKP refresh (needs_gkp=true inside processed_volume_trend_meta), fetches their absolute monthly volumes + CPC via DataForSEO, and writes to keywords_volume.gkp_trends_raw on isog. Like extract_gt.py, this is decoupled from the hub loop — kvprocessor.cpp pulls the refreshed data into keywords_data_local on its next run.
How it runs¶
Architecture: 1 API dispatcher (paces requests under DFSEO's rate cap) + 10 worker threads (merge old + new trends, batch-insert) + 1 progress monitor.
Per-shard candidate cap default 100 000 keywords to keep the staging table size bounded. Batch size for DFSEO API requests is up to ~1 000 keywords per call.
Load-bearing constants¶
| Constant | Slug | Value | Note |
|---|---|---|---|
| DataForSEO request rate | gkp-dfseo-rate-limit |
queries_per_minute = 10 |
DFSEO documented hard cap is 12 req/min; we leave headroom |
| API retry + invalid-char strip | gkp-api-retry-strip |
nretries = 3, max_strips = 5 |
DFSEO returns error code 40501 for keywords with restricted symbols/emojis; we strip the offending keyword up to 5 times, then abandon the batch |
| CH insert retry → FATAL | gkp-ch-insert-retry |
max_retries: int = 20 |
20 retries then raises a FATAL exception — much harsher than GT's "drop rows and continue" |
Anchors live in src/extract_gkp.py.
Failure modes¶
| Symptom | Cause | Behaviour |
|---|---|---|
| DFSEO API timeout / generic error | Upstream blip | 3 retries with 20–30 s sleep, then abandon batch |
| DFSEO 40501 (restricted symbols / adult content) | Keyword violates Google Ads policy | Strip offending keyword up to 5 times; if still failing, abandon batch. The keyword stays flagged needs_gkp=true for next cycle. |
| Dispatcher pacing drift | Throttle algorithm misfires | DFSEO may reject subsequent requests until the per-minute window resets — API quota burned for that window |
| Shard scan timeout | CH slow | 2400 s per shard, 5 failover attempts (laksa ↔ prata), log and skip |
CH insert into gkp_trends_raw fails |
isog slow / restarting | 20 retries with 10–20 s random backoff; on full exhaustion, raises a FATAL exception that stops the run. This is intentionally harsher than GT's drop-rows behaviour because GKP data is more expensive to re-fetch (cost-per-call billing). |
The FATAL-on-exhaustion path is the operationally distinct behaviour vs. extract_gt.py. It means a sustained isog write outage halts GKP refresh and someone has to restart the script.
Runtime is indeterminate — depends on DFSEO latency, batch sizes that complete, and CH write throughput.
Auth & config¶
- DataForSEO key:
/home/team/mimi/data/dataforseo_key.txt(2-line file: username + password) - ClickHouse:
isog.int.ahrefs:9001, admin creds from/etc/clickhouse-client/config.json - Staging table reuse TTL:
STAGING_FRESHNESS_HOURS = 24(wider than GT's 6 h because GKP refresh is less frequent)
Important interaction with GKP all-zero handling¶
The "GKP returns null/zero for keywords prohibited under Google Ads policy" caveat (adult content, emojis, restricted symbols) is documented on the GKP source page. That caveat is partly why this extractor strips offending keywords rather than treating the 40501 response as a failure — but downstream, the gkp-validity-with-js-fallback and gt-gsc-correlation-gate decisions need to know that an all-zero GKP series could be a prohibited-keyword artefact rather than a true zero.
See also¶
- GKP source page — what GKP carries, the validity gates, and the prohibited-keywords caveat
- Central hub table — where
needs_gkplives and where refreshed data eventually lands - extract_gt.py — sibling extractor; same architecture pattern, different API + failure semantics
- DataForSEO Google Ads — Search Volume Live — upstream API docs
extract_gkp_gt.sh(in project root) — wrapper scheduler- Anchors:
src/extract_gkp.py:KB-ANCHOR:gkp-dfseo-rate-limit,gkp-api-retry-strip,gkp-ch-insert-retry