keep_running.sh — the hourly orchestrator¶
Code: keep_running.sh. Two copies exist — see working copies for why.
Last validated: 2026-05-21
What it does¶
Drives the hourly hub-loop: kvprocessor → processing.py (Slurm) → pe_update.py → 3 snapshot scripts → upload_backend.sh → sleep → re-exec. Runs forever in a screen session.
The loop¶
cd ~/keyword-volume || exit 1
./_build/kvprocessor # Stage 1 ingestion
sbatch --wait process_keywords.slurm # Stage 2 derivation (blocks until Slurm completes)
python pe_update.py # Stage 3 PE selection
python record_confidence_snapshot.py # Monitoring snapshot
python record_prediction_coverage_snapshot.py # Monitoring snapshot
python record_bot_spike_counts_snapshot.py # Monitoring snapshot
./upload_backend.sh # Stage 4 distribution
# KB-ANCHOR: hourly-cadence
sleep 3600
exec "$SCRIPT_PATH" "$@"
Anchor: keep_running.sh:KB-ANCHOR:hourly-cadence (the sleep 3600 — the loop cadence).
The exec "$SCRIPT_PATH" line means a new iteration re-execs the script rather than recursing. Two consequences:
- Latest code is picked up every hour — edit the script while it runs, the next iteration runs the new version.
- No stack growth —
execreplaces the current process, so there's no recursion depth concern even after months of running.
Why no early-exit on failure¶
There's no set -e, no traps, no error handling. The only guard is cd ~/keyword-volume || exit 1 at the top. If any stage fails:
kvprocessorfailure → next stages still run with stale upstream data- Slurm job failure → next stages still run (likely with empty / partial data from this iteration)
pe_update.pyexception → next stages still run; PE tracking just doesn't update this cycle- Snapshot script failure → next stages still run; monitoring history skips one row
upload_backend.shpartial failure → version skew briefly; next iteration re-uploads
This is intentional. Partial coverage in one iteration is corrected on the next. The pipeline is designed to degrade gracefully rather than block on transient failures.
The only catastrophic failure that breaks the loop is cd ~/keyword-volume failing — which would require the home directory to be unmounted (very rare).
Runtime + observability¶
- Tracing:
set -xwithPS4='[$(date "+%Y-%m-%d %H:%M:%S")] '— every command line prints with a timestamp prefix. - Logs: redirected by however the operator launched the screen session — typically
screen -L -Logfile ...ornohup. - Per-iteration duration: indeterminate. The loop's effective frequency is
60 minutes + cumulative stage time.
Operator notes¶
- Launched once, by an operator, inside a screen session named
process_keywordondata@isog. Project memoryfeedback_demo_app_systemddoes NOT apply here — this script is not systemd-managed. - To stop: attach to the screen session and Ctrl-C. The current iteration finishes whatever stage it's in; no graceful shutdown.
- To restart with new code: the running iteration re-execs at the end of its cycle. No manual restart needed unless something blocks the loop.
See also¶
- Architecture — pipeline overview
- working-copies concept — why this script exists in two locations
- Each stage: kvprocessor, processing.py, pe_update.py, monitoring snapshots, upload-backend
extract_gkp_gt.sh(scheduler) — the parallel orchestrator for refresh extraction, independent of this loop