Skip to content

upload-backend — Stage 4 distribution

Code: upload_backend.sh + upload_keywords_metrics_ch.slurm + upload_keywords_metrics_es.slurm (all in the project root). Last validated: 2026-05-21

What it does

The fan-out step. Reads the processed_* columns from keywords.keywords_data_local and replicates them into the two customer-facing stores: ClickHouse (keywords_metrics_local, used by internal SERP / backend services) and Elasticsearch (keywords-all-* + keywords-lite-* indices, used by the product API).

Both jobs run on the same 200-laksa-node fleet via Slurm. They are submitted sequentially by upload_backend.sh (sbatch --wait) but each is independent — failure of one does not stop the other.

How it runs

ClickHouse job — upload_keywords_metrics_ch.slurm

Four sequential srun steps per node:

Step Function What it does
1 upload() Runs ./bin/ahrefs clickstream with -stream-processor ch_keywords_metrics. 50 workers, batch size 1000, scroll size 1000. Reads keywords_data_local, writes keywords_metrics_local.
2 sync_replicas() SYSTEM SYNC REPLICA ON CLUSTER … LIGHTWEIGHT (3600 s timeout) — flushes follower replicas.
3 refresh_keywords_metrics_ch() Four-step dictionary refresh: confine_sharddelete_old_dictrefresh_dictapply_dict.
4 restart_serps() POSTs /exit to the hotdog SERP service on ports 8093–8100. Forces a restart so the SERP service picks up the freshly-refreshed dictionary.

Slurm config: --cpus-per-task=50, --mem=64GB, 1–200 nodes, --no-kill so a single-node failure doesn't kill the array.

The SERP restart at step 4 is a real coupling between the keyword-volume pipeline and the SERP service — changes to the upload here can affect SERP availability briefly. The /exit POSTs assume hotdog node naming (hotdog{NNN}.int.ahrefs) symmetric with laksa{NNN}.

Elasticsearch job — upload_keywords_metrics_es.slurm

Single srun upload() step per node:

  • ./bin/ahrefs clickstream with -stream-processor es_keywords_import
  • 6 workers, batch size 2000, scroll size 2000, min-ranges 1024
  • Indices: -all-20220613 (full keyword index) and -lite-20241025 (lite)
  • -write-follower-indices — replicates writes to follower indices automatically
  • -keywords-import-reader-cluster laksa — reads from the laksa cluster

Slurm config: --cpus-per-task=6, --mem=64GB, 1–200 nodes.

Failure modes

Symptom Cause Behaviour
CH job fails on some nodes CH replica slow, ahrefs binary exception --no-kill keeps other nodes running; partial replication; next hourly cycle re-uploads everything
ES job fails on some nodes ES cluster pressure, ahrefs binary exception Same — partial replication
CH succeeds, ES fails (or vice versa) Independent runs Brief version skew between the two stores until the next iteration
SERP restart fails hotdog node unreachable restart_serps just logs the failed curl and moves to next port; SERP service stays on the old dict for that node
sync_replicas 3600 s timeout Replica too far behind SYNC REPLICA returns error; downstream dict refresh may run against incomplete data; next cycle re-syncs

The "no early-exit on stage failure" pattern from keep_running.sh applies here too: even if upload partially fails, the next iteration tries again with fresh data.

Runtime is indeterminate — depends on the size of changes since last upload and CH/ES throughput.

Load-bearing constants

These live in the two .slurm files. The .slurm files are intentionally not KB-ANCHORed — the project rule (see CLAUDE.md) is to not modify Slurm job files. When the constants below drift, update this page's last_validated date manually and re-verify the line refs.

Where Constant Value(s)
upload_keywords_metrics_ch.slurm upload args workers / batch / scroll 50 / 1000 / 1000
upload_keywords_metrics_ch.slurm restart_serps SERP restart ports POST hotdog{NNN}:[8093-8100]/exit
upload_keywords_metrics_es.slurm upload args workers / batch / scroll / min-ranges 6 / 2000 / 2000 / 1024
upload_keywords_metrics_es.slurm indices keyword index suffix / lite index suffix -all-20220613 / -lite-20241025

See also

  • Central hub table — read surface
  • processing.py — upstream stage that fills the processed_* columns
  • pe_update.py — upstream stage that fills processed_pe_*
  • Consumers — the customer-facing API and SERP services that read the destinations (Phase 5)
  • (Phase 2 · Batch 3c) keep-running.md — orchestrator that invokes this step