Skip to content

extract_gkp_gt.sh — refresh-extractor scheduler

Code: extract_gkp_gt.sh. Last validated: 2026-05-21

What it does

The wrapper that drives both refresh extractors. Independent of keep_running.sh. Two cooperating loops:

  • GT loop (foreground): extract_gt.py --tier 1extract_gt.py --tier 2 per cycle
  • GKP loop (background): extract_gkp.py --max-batches 1000 per cycle

Plus an hourly throughput monitor (background, only when run in loop mode) that appends [HOURLY ...] summaries with inserted this hour / cumulative counts to the current logs.

Two ways to invoke

# one-shot — runs each extractor once, then exits
./extract_gkp_gt.sh

# loop forever, sleeping <sleep_time> seconds between iterations
./extract_gkp_gt.sh 1800     # e.g. 30-minute cycles

The loop variant is the production mode (typically launched once and left in a long-lived screen session, similar to how keep_running.sh is launched).

Interlock with processing.py

Both loops call wait_for_slurm_idle() before each cycle. This blocks until no process_keywords Slurm job is running, then proceeds. While waiting, the script sleeps 12 hours between rechecks.

The interlock exists to keep refresh traffic off laksa / prata / isog while the main pipeline is doing heavy CH work — both processing.py and the extractors hit the same ClickHouse instances. Running them simultaneously starves both.

Logs

  • Directory: /home/data/data_extract_logs/
  • GT tier 1: gt_tier1_YYYYMMDD_HHMMSS.log
  • GT tier 2: gt_tier2_YYYYMMDD_HHMMSS.log
  • GKP: gkp_YYYYMMDD_HHMMSS.log
  • Cleanup: files older than 30 days are deleted on each cycle (cleanup_old_logs).
  • Pointer files (/tmp mktemp) track the current log path so the hourly monitor knows where to append its summaries.

Failure modes

  • Either extractor crashes — the script's trap EXIT kills both background processes (GKP loop + hourly monitor) cleanly. No automatic restart; the operator has to relaunch.
  • process_keywords never idles — both loops block indefinitely, rechecking every 12 h. Pipeline must be healthy enough to leave windows of idleness, or refresh starves forever.
  • Log directory full / write failure — extractors will fail with output to the dead log; no special recovery.

See also

  • extract_gt.py — the GT extractor this script schedules
  • extract_gkp.py — the GKP extractor this script schedules
  • keep_running.md (Sub-batch 3c) — the orchestrator for the main hourly hub-loop (parallel to this scheduler)
  • Architecture — how the two schedulers (this one + keep_running.sh) fit into the pipeline