kvprocessor — Stage 1 ingestion¶
Code: kvprocessor.cpp (the whole file is the program).
Last validated: 2026-05-21
What it does¶
The C++ binary that pulls every raw source into the central hub table on every iteration of the hourly loop. It is standalone — not launched by Slurm. The orchestrator (keep_running.sh) just invokes ./_build/kvprocessor directly.
Five parallel operations run concurrently:
PullGTNewData()/PullGTData()— fetch fresh Google Trends rows (and historical on first init) fromkeywords_volume.google_trendson isogPullJSFirstSeenData()— read JS first-seen metadata fromjumpshot.keyword_first_seenon isogProcessGSCRangeData()— pull GSC from peer laksa shards, dedup, roll up to monthly, write to the hubProcessGKPRangeData()— pull GKP per shard/range fromkeywords_volume.gkp_trends_rawon isog and write to the hub
Each writes its assigned columns on keywords.keywords_data_local; the anyLast engine semantics mean none of them clobber each other (see Central hub table).
How it runs¶
- Concurrency: per-shard
ThreadPools with 8–25 threads per shard (configurable). One range = one task; data flow is within-shard — no scatter-gather across shards. - ClickHouse client: 7200 s recv timeout, 300 s send timeout, TCP keepalive every 60 s. Single connection per server, multiplexed across threads.
- Iteration model: each pull function spawns per-shard worker tasks;
ProcessGSCRangeDatacovers all 256 ranges of all 200 shards.
Pulling from another laksa: retry + failover¶
Anchor: kvprocessor.cpp:KB-ANCHOR:ch-pull-retry-policy (PullDataFromAnotherLaksa).
On a failed pull (CH timeout, network blip, replica unreachable) the function retries up to 10 times with 30-second back-off between attempts. Each retry alternates the target between the laksa{NNN} and prata{NNN} hosts (the replica pair), so a single-host outage is transparent. After 10 failures the shard is skipped, not retried in the same iteration — the iteration completes with that shard's pull missing.
This means a flaky CH cluster degrades the iteration's coverage but never blocks the orchestrator. The next iteration of keep_running.sh re-tries everything from scratch.
Source-side freshness filter (GSC)¶
Anchor: kvprocessor.cpp:KB-ANCHOR:gsc-aggregated-45-day-window.
The GSC source-table read includes WHERE metrics_imported_at_max > DATE_SUB(curdate(), 45). This is a performance knob, not a data-quality threshold — it bounds the per-shard scan so kvprocessor completes in the hourly window without blowing out the laksa server workers. Older GSC history is still present in the upstream raw tables; it's just not part of the kvprocessor read path. (See GSC source for the longer framing.)
Failure modes¶
Pipeline-blocking failures are rare; most "failures" are partial-coverage events that the next hourly cycle corrects.
| Symptom | Cause | Behaviour |
|---|---|---|
| Single shard's GSC pull fails | CH replica slow / unreachable | 10 retries with laksa↔prata flip, then skip shard for this iteration |
| Source table unavailable (e.g. isog down) | Upstream outage | That source's pull halts; ProcessGSCRangeData and other pulls continue with stale upstream data |
| Range-level CH query exception | Schema mismatch, dependent dictionary stale | Logged, that range skipped |
| Wall-clock blowup | Source table backlog | No internal deadline — the iteration just takes longer; keep_running.sh waits for it before launching Stage 2 |
Logs land in /home/data/kvprocess_logs/log_YYYY-MM-DD_HH-MM-SS.txt plus stdout.
Runtime characteristics¶
Per-iteration duration is indeterminate — depends on CH load, source-table sizes, replica availability. Don't expect a steady "X minutes per run".
See also¶
- Architecture — full pipeline diagram
- Central hub table — destination of every write
- GSC · GT · GKP · Jumpshot — what each pull function carries
keep-running.md(Phase 2 · Batch 3c) — the orchestrator that invokes this binary- Build:
make build(usesCMakeLists.txt/Makefile); deploy:make build_on_isog