extract_gkp_gt.sh — refresh-extractor scheduler¶
Code: extract_gkp_gt.sh.
Last validated: 2026-05-21
What it does¶
The wrapper that drives both refresh extractors. Independent of keep_running.sh. Two cooperating loops:
- GT loop (foreground):
extract_gt.py --tier 1→extract_gt.py --tier 2per cycle - GKP loop (background):
extract_gkp.py --max-batches 1000per cycle
Plus an hourly throughput monitor (background, only when run in loop mode) that appends [HOURLY ...] summaries with inserted this hour / cumulative counts to the current logs.
Two ways to invoke¶
# one-shot — runs each extractor once, then exits
./extract_gkp_gt.sh
# loop forever, sleeping <sleep_time> seconds between iterations
./extract_gkp_gt.sh 1800 # e.g. 30-minute cycles
The loop variant is the production mode (typically launched once and left in a long-lived screen session, similar to how keep_running.sh is launched).
Interlock with processing.py¶
Both loops call wait_for_slurm_idle() before each cycle. This blocks until no process_keywords Slurm job is running, then proceeds. While waiting, the script sleeps 12 hours between rechecks.
The interlock exists to keep refresh traffic off laksa / prata / isog while the main pipeline is doing heavy CH work — both processing.py and the extractors hit the same ClickHouse instances. Running them simultaneously starves both.
Logs¶
- Directory:
/home/data/data_extract_logs/ - GT tier 1:
gt_tier1_YYYYMMDD_HHMMSS.log - GT tier 2:
gt_tier2_YYYYMMDD_HHMMSS.log - GKP:
gkp_YYYYMMDD_HHMMSS.log - Cleanup: files older than 30 days are deleted on each cycle (
cleanup_old_logs). - Pointer files (
/tmpmktemp) track the current log path so the hourly monitor knows where to append its summaries.
Failure modes¶
- Either extractor crashes — the script's
trap EXITkills both background processes (GKP loop + hourly monitor) cleanly. No automatic restart; the operator has to relaunch. process_keywordsnever idles — both loops block indefinitely, rechecking every 12 h. Pipeline must be healthy enough to leave windows of idleness, or refresh starves forever.- Log directory full / write failure — extractors will fail with output to the dead log; no special recovery.
See also¶
- extract_gt.py — the GT extractor this script schedules
- extract_gkp.py — the GKP extractor this script schedules
- keep_running.md (Sub-batch 3c) — the orchestrator for the main hourly hub-loop (parallel to this scheduler)
- Architecture — how the two schedulers (this one +
keep_running.sh) fit into the pipeline