Skip to content

kvprocessor — Stage 1 ingestion

Code: kvprocessor.cpp (the whole file is the program). Last validated: 2026-05-21

What it does

The C++ binary that pulls every raw source into the central hub table on every iteration of the hourly loop. It is standalone — not launched by Slurm. The orchestrator (keep_running.sh) just invokes ./_build/kvprocessor directly.

Five parallel operations run concurrently:

  • PullGTNewData() / PullGTData() — fetch fresh Google Trends rows (and historical on first init) from keywords_volume.google_trends on isog
  • PullJSFirstSeenData() — read JS first-seen metadata from jumpshot.keyword_first_seen on isog
  • ProcessGSCRangeData() — pull GSC from peer laksa shards, dedup, roll up to monthly, write to the hub
  • ProcessGKPRangeData() — pull GKP per shard/range from keywords_volume.gkp_trends_raw on isog and write to the hub

Each writes its assigned columns on keywords.keywords_data_local; the anyLast engine semantics mean none of them clobber each other (see Central hub table).

How it runs

  • Concurrency: per-shard ThreadPools with 8–25 threads per shard (configurable). One range = one task; data flow is within-shard — no scatter-gather across shards.
  • ClickHouse client: 7200 s recv timeout, 300 s send timeout, TCP keepalive every 60 s. Single connection per server, multiplexed across threads.
  • Iteration model: each pull function spawns per-shard worker tasks; ProcessGSCRangeData covers all 256 ranges of all 200 shards.

Pulling from another laksa: retry + failover

Anchor: kvprocessor.cpp:KB-ANCHOR:ch-pull-retry-policy (PullDataFromAnotherLaksa).

On a failed pull (CH timeout, network blip, replica unreachable) the function retries up to 10 times with 30-second back-off between attempts. Each retry alternates the target between the laksa{NNN} and prata{NNN} hosts (the replica pair), so a single-host outage is transparent. After 10 failures the shard is skipped, not retried in the same iteration — the iteration completes with that shard's pull missing.

This means a flaky CH cluster degrades the iteration's coverage but never blocks the orchestrator. The next iteration of keep_running.sh re-tries everything from scratch.

Source-side freshness filter (GSC)

Anchor: kvprocessor.cpp:KB-ANCHOR:gsc-aggregated-45-day-window.

The GSC source-table read includes WHERE metrics_imported_at_max > DATE_SUB(curdate(), 45). This is a performance knob, not a data-quality threshold — it bounds the per-shard scan so kvprocessor completes in the hourly window without blowing out the laksa server workers. Older GSC history is still present in the upstream raw tables; it's just not part of the kvprocessor read path. (See GSC source for the longer framing.)

Failure modes

Pipeline-blocking failures are rare; most "failures" are partial-coverage events that the next hourly cycle corrects.

Symptom Cause Behaviour
Single shard's GSC pull fails CH replica slow / unreachable 10 retries with laksa↔prata flip, then skip shard for this iteration
Source table unavailable (e.g. isog down) Upstream outage That source's pull halts; ProcessGSCRangeData and other pulls continue with stale upstream data
Range-level CH query exception Schema mismatch, dependent dictionary stale Logged, that range skipped
Wall-clock blowup Source table backlog No internal deadline — the iteration just takes longer; keep_running.sh waits for it before launching Stage 2

Logs land in /home/data/kvprocess_logs/log_YYYY-MM-DD_HH-MM-SS.txt plus stdout.

Runtime characteristics

Per-iteration duration is indeterminate — depends on CH load, source-table sizes, replica availability. Don't expect a steady "X minutes per run".

See also

  • Architecture — full pipeline diagram
  • Central hub table — destination of every write
  • GSC · GT · GKP · Jumpshot — what each pull function carries
  • keep-running.md (Phase 2 · Batch 3c) — the orchestrator that invokes this binary
  • Build: make build (uses CMakeLists.txt / Makefile); deploy: make build_on_isog