upload-backend — Stage 4 distribution¶
Code: upload_backend.sh + upload_keywords_metrics_ch.slurm + upload_keywords_metrics_es.slurm (all in the project root).
Last validated: 2026-05-21
What it does¶
The fan-out step. Reads the processed_* columns from keywords.keywords_data_local and replicates them into the two customer-facing stores: ClickHouse (keywords_metrics_local, used by internal SERP / backend services) and Elasticsearch (keywords-all-* + keywords-lite-* indices, used by the product API).
Both jobs run on the same 200-laksa-node fleet via Slurm. They are submitted sequentially by upload_backend.sh (sbatch --wait) but each is independent — failure of one does not stop the other.
How it runs¶
ClickHouse job — upload_keywords_metrics_ch.slurm¶
Four sequential srun steps per node:
| Step | Function | What it does |
|---|---|---|
| 1 | upload() |
Runs ./bin/ahrefs clickstream with -stream-processor ch_keywords_metrics. 50 workers, batch size 1000, scroll size 1000. Reads keywords_data_local, writes keywords_metrics_local. |
| 2 | sync_replicas() |
SYSTEM SYNC REPLICA ON CLUSTER … LIGHTWEIGHT (3600 s timeout) — flushes follower replicas. |
| 3 | refresh_keywords_metrics_ch() |
Four-step dictionary refresh: confine_shard → delete_old_dict → refresh_dict → apply_dict. |
| 4 | restart_serps() |
POSTs /exit to the hotdog SERP service on ports 8093–8100. Forces a restart so the SERP service picks up the freshly-refreshed dictionary. |
Slurm config: --cpus-per-task=50, --mem=64GB, 1–200 nodes, --no-kill so a single-node failure doesn't kill the array.
The SERP restart at step 4 is a real coupling between the keyword-volume pipeline and the SERP service — changes to the upload here can affect SERP availability briefly. The
/exitPOSTs assume hotdog node naming (hotdog{NNN}.int.ahrefs) symmetric withlaksa{NNN}.
Elasticsearch job — upload_keywords_metrics_es.slurm¶
Single srun upload() step per node:
./bin/ahrefs clickstreamwith-stream-processor es_keywords_import- 6 workers, batch size 2000, scroll size 2000, min-ranges 1024
- Indices:
-all-20220613(full keyword index) and-lite-20241025(lite) -write-follower-indices— replicates writes to follower indices automatically-keywords-import-reader-cluster laksa— reads from the laksa cluster
Slurm config: --cpus-per-task=6, --mem=64GB, 1–200 nodes.
Failure modes¶
| Symptom | Cause | Behaviour |
|---|---|---|
| CH job fails on some nodes | CH replica slow, ahrefs binary exception | --no-kill keeps other nodes running; partial replication; next hourly cycle re-uploads everything |
| ES job fails on some nodes | ES cluster pressure, ahrefs binary exception | Same — partial replication |
| CH succeeds, ES fails (or vice versa) | Independent runs | Brief version skew between the two stores until the next iteration |
| SERP restart fails | hotdog node unreachable | restart_serps just logs the failed curl and moves to next port; SERP service stays on the old dict for that node |
sync_replicas 3600 s timeout |
Replica too far behind | SYNC REPLICA returns error; downstream dict refresh may run against incomplete data; next cycle re-syncs |
The "no early-exit on stage failure" pattern from keep_running.sh applies here too: even if upload partially fails, the next iteration tries again with fresh data.
Runtime is indeterminate — depends on the size of changes since last upload and CH/ES throughput.
Load-bearing constants¶
These live in the two .slurm files. The .slurm files are intentionally not KB-ANCHORed — the project rule (see CLAUDE.md) is to not modify Slurm job files. When the constants below drift, update this page's last_validated date manually and re-verify the line refs.
| Where | Constant | Value(s) |
|---|---|---|
upload_keywords_metrics_ch.slurm upload args |
workers / batch / scroll | 50 / 1000 / 1000 |
upload_keywords_metrics_ch.slurm restart_serps |
SERP restart ports | POST hotdog{NNN}:[8093-8100]/exit |
upload_keywords_metrics_es.slurm upload args |
workers / batch / scroll / min-ranges | 6 / 2000 / 2000 / 1024 |
upload_keywords_metrics_es.slurm indices |
keyword index suffix / lite index suffix | -all-20220613 / -lite-20241025 |
See also¶
- Central hub table — read surface
- processing.py — upstream stage that fills the
processed_*columns - pe_update.py — upstream stage that fills
processed_pe_* - Consumers — the customer-facing API and SERP services that read the destinations (Phase 5)
- (Phase 2 · Batch 3c)
keep-running.md— orchestrator that invokes this step