upload-backend — Stage 4 distribution¶

Code: upload_backend.sh + upload_keywords_metrics_ch.slurm + upload_keywords_metrics_es.slurm (all in the project root). Last validated: 2026-05-21

What it does¶

The fan-out step. Reads the processed_* columns from keywords.keywords_data_local and replicates them into the two customer-facing stores: ClickHouse (keywords_metrics_local, used by internal SERP / backend services) and Elasticsearch (keywords-all-* + keywords-lite-* indices, used by the product API).

Both jobs run on the same 200-laksa-node fleet via Slurm. They are submitted sequentially by upload_backend.sh (sbatch --wait) but each is independent — failure of one does not stop the other.

How it runs¶

ClickHouse job — `upload_keywords_metrics_ch.slurm`¶

Four sequential srun steps per node:

Step	Function	What it does
1	`upload()`	Runs `./bin/ahrefs clickstream` with `-stream-processor ch_keywords_metrics`. 50 workers, batch size 1000, scroll size 1000. Reads `keywords_data_local`, writes `keywords_metrics_local`.
2	`sync_replicas()`	`SYSTEM SYNC REPLICA ON CLUSTER … LIGHTWEIGHT` (3600 s timeout) — flushes follower replicas.
3	`refresh_keywords_metrics_ch()`	Four-step dictionary refresh: `confine_shard` → `delete_old_dict` → `refresh_dict` → `apply_dict`.
4	`restart_serps()`	POSTs `/exit` to the hotdog SERP service on ports 8093–8100. Forces a restart so the SERP service picks up the freshly-refreshed dictionary.

Slurm config: --cpus-per-task=50, --mem=64GB, 1–200 nodes, --no-kill so a single-node failure doesn't kill the array.

The SERP restart at step 4 is a real coupling between the keyword-volume pipeline and the SERP service — changes to the upload here can affect SERP availability briefly. The /exit POSTs assume hotdog node naming (hotdog{NNN}.int.ahrefs) symmetric with laksa{NNN}.

Elasticsearch job — `upload_keywords_metrics_es.slurm`¶

Single srun upload() step per node:

./bin/ahrefs clickstream with -stream-processor es_keywords_import
6 workers, batch size 2000, scroll size 2000, min-ranges 1024
Indices: -all-20220613 (full keyword index) and -lite-20241025 (lite)
-write-follower-indices — replicates writes to follower indices automatically
-keywords-import-reader-cluster laksa — reads from the laksa cluster

Slurm config: --cpus-per-task=6, --mem=64GB, 1–200 nodes.

Failure modes¶

Symptom	Cause	Behaviour
CH job fails on some nodes	CH replica slow, ahrefs binary exception	`--no-kill` keeps other nodes running; partial replication; next hourly cycle re-uploads everything
ES job fails on some nodes	ES cluster pressure, ahrefs binary exception	Same — partial replication
CH succeeds, ES fails (or vice versa)	Independent runs	Brief version skew between the two stores until the next iteration
SERP restart fails	hotdog node unreachable	`restart_serps` just logs the failed `curl` and moves to next port; SERP service stays on the old dict for that node
`sync_replicas` 3600 s timeout	Replica too far behind	SYNC REPLICA returns error; downstream dict refresh may run against incomplete data; next cycle re-syncs

The "no early-exit on stage failure" pattern from keep_running.sh applies here too: even if upload partially fails, the next iteration tries again with fresh data.

Runtime is indeterminate — depends on the size of changes since last upload and CH/ES throughput.

Load-bearing constants¶

These live in the two .slurm files. The .slurm files are intentionally not KB-ANCHORed — the project rule (see CLAUDE.md) is to not modify Slurm job files. When the constants below drift, update this page's last_validated date manually and re-verify the line refs.

Where	Constant	Value(s)
`upload_keywords_metrics_ch.slurm` upload args	workers / batch / scroll	50 / 1000 / 1000
`upload_keywords_metrics_ch.slurm` restart_serps	SERP restart ports	POST `hotdog{NNN}:[8093-8100]/exit`
`upload_keywords_metrics_es.slurm` upload args	workers / batch / scroll / min-ranges	6 / 2000 / 2000 / 1024
`upload_keywords_metrics_es.slurm` indices	keyword index suffix / lite index suffix	`-all-20220613` / `-lite-20241025`