Skip to content

keep_running.sh — the hourly orchestrator

Code: keep_running.sh. Two copies exist — see working copies for why. Last validated: 2026-05-21

What it does

Drives the hourly hub-loop: kvprocessor → processing.py (Slurm) → pe_update.py → 3 snapshot scripts → upload_backend.sh → sleep → re-exec. Runs forever in a screen session.

The loop

cd ~/keyword-volume || exit 1

./_build/kvprocessor                                        # Stage 1 ingestion
sbatch --wait process_keywords.slurm                        # Stage 2 derivation (blocks until Slurm completes)
python pe_update.py                                         # Stage 3 PE selection
python record_confidence_snapshot.py                        # Monitoring snapshot
python record_prediction_coverage_snapshot.py               # Monitoring snapshot
python record_bot_spike_counts_snapshot.py                  # Monitoring snapshot
./upload_backend.sh                                         # Stage 4 distribution
# KB-ANCHOR: hourly-cadence
sleep 3600

exec "$SCRIPT_PATH" "$@"

Anchor: keep_running.sh:KB-ANCHOR:hourly-cadence (the sleep 3600 — the loop cadence).

The exec "$SCRIPT_PATH" line means a new iteration re-execs the script rather than recursing. Two consequences:

  1. Latest code is picked up every hour — edit the script while it runs, the next iteration runs the new version.
  2. No stack growthexec replaces the current process, so there's no recursion depth concern even after months of running.

Why no early-exit on failure

There's no set -e, no traps, no error handling. The only guard is cd ~/keyword-volume || exit 1 at the top. If any stage fails:

  • kvprocessor failure → next stages still run with stale upstream data
  • Slurm job failure → next stages still run (likely with empty / partial data from this iteration)
  • pe_update.py exception → next stages still run; PE tracking just doesn't update this cycle
  • Snapshot script failure → next stages still run; monitoring history skips one row
  • upload_backend.sh partial failure → version skew briefly; next iteration re-uploads

This is intentional. Partial coverage in one iteration is corrected on the next. The pipeline is designed to degrade gracefully rather than block on transient failures.

The only catastrophic failure that breaks the loop is cd ~/keyword-volume failing — which would require the home directory to be unmounted (very rare).

Runtime + observability

  • Tracing: set -x with PS4='[$(date "+%Y-%m-%d %H:%M:%S")] ' — every command line prints with a timestamp prefix.
  • Logs: redirected by however the operator launched the screen session — typically screen -L -Logfile ... or nohup.
  • Per-iteration duration: indeterminate. The loop's effective frequency is 60 minutes + cumulative stage time.

Operator notes

  • Launched once, by an operator, inside a screen session named process_keyword on data@isog. Project memory feedback_demo_app_systemd does NOT apply here — this script is not systemd-managed.
  • To stop: attach to the screen session and Ctrl-C. The current iteration finishes whatever stage it's in; no graceful shutdown.
  • To restart with new code: the running iteration re-execs at the end of its cycle. No manual restart needed unless something blocks the loop.

See also