kvprocessor — Stage 1 ingestion¶

Code: kvprocessor.cpp (the whole file is the program). Last validated: 2026-05-21

What it does¶

The C++ binary that pulls every raw source into the central hub table on every iteration of the hourly loop. It is standalone — not launched by Slurm. The orchestrator (keep_running.sh) just invokes ./_build/kvprocessor directly.

Five parallel operations run concurrently:

PullGTNewData() / PullGTData() — fetch fresh Google Trends rows (and historical on first init) from keywords_volume.google_trends on isog
PullJSFirstSeenData() — read JS first-seen metadata from jumpshot.keyword_first_seen on isog
ProcessGSCRangeData() — pull GSC from peer laksa shards, dedup, roll up to monthly, write to the hub
ProcessGKPRangeData() — pull GKP per shard/range from keywords_volume.gkp_trends_raw on isog and write to the hub

Each writes its assigned columns on keywords.keywords_data_local; the anyLast engine semantics mean none of them clobber each other (see Central hub table).

How it runs¶

Concurrency: per-shard ThreadPools with 8–25 threads per shard (configurable). One range = one task; data flow is within-shard — no scatter-gather across shards.
ClickHouse client: 7200 s recv timeout, 300 s send timeout, TCP keepalive every 60 s. Single connection per server, multiplexed across threads.
Iteration model: each pull function spawns per-shard worker tasks; ProcessGSCRangeData covers all 256 ranges of all 200 shards.

Pulling from another laksa: retry + failover¶

Anchor: kvprocessor.cpp:KB-ANCHOR:ch-pull-retry-policy (PullDataFromAnotherLaksa).

On a failed pull (CH timeout, network blip, replica unreachable) the function retries up to 10 times with 30-second back-off between attempts. Each retry alternates the target between the laksa{NNN} and prata{NNN} hosts (the replica pair), so a single-host outage is transparent. After 10 failures the shard is skipped, not retried in the same iteration — the iteration completes with that shard's pull missing.

This means a flaky CH cluster degrades the iteration's coverage but never blocks the orchestrator. The next iteration of keep_running.sh re-tries everything from scratch.

Source-side freshness filter (GSC)¶

Anchor: kvprocessor.cpp:KB-ANCHOR:gsc-aggregated-45-day-window.

The GSC source-table read includes WHERE metrics_imported_at_max > DATE_SUB(curdate(), 45). This is a performance knob, not a data-quality threshold — it bounds the per-shard scan so kvprocessor completes in the hourly window without blowing out the laksa server workers. Older GSC history is still present in the upstream raw tables; it's just not part of the kvprocessor read path. (See GSC source for the longer framing.)

Failure modes¶

Pipeline-blocking failures are rare; most "failures" are partial-coverage events that the next hourly cycle corrects.

Symptom	Cause	Behaviour
Single shard's GSC pull fails	CH replica slow / unreachable	10 retries with laksa↔prata flip, then skip shard for this iteration
Source table unavailable (e.g. isog down)	Upstream outage	That source's pull halts; ProcessGSCRangeData and other pulls continue with stale upstream data
Range-level CH query exception	Schema mismatch, dependent dictionary stale	Logged, that range skipped
Wall-clock blowup	Source table backlog	No internal deadline — the iteration just takes longer; `keep_running.sh` waits for it before launching Stage 2

Logs land in /home/data/kvprocess_logs/log_YYYY-MM-DD_HH-MM-SS.txt plus stdout.

Runtime characteristics¶

Per-iteration duration is indeterminate — depends on CH load, source-table sizes, replica availability. Don't expect a steady "X minutes per run".