extract_gkp_gt.sh — refresh-extractor scheduler¶

Code: extract_gkp_gt.sh. Last validated: 2026-05-21

What it does¶

The wrapper that drives both refresh extractors. Independent of keep_running.sh. Two cooperating loops:

GT loop (foreground): extract_gt.py --tier 1 → extract_gt.py --tier 2 per cycle
GKP loop (background): extract_gkp.py --max-batches 1000 per cycle

Plus an hourly throughput monitor (background, only when run in loop mode) that appends [HOURLY ...] summaries with inserted this hour / cumulative counts to the current logs.

Two ways to invoke¶

# one-shot — runs each extractor once, then exits
./extract_gkp_gt.sh

# loop forever, sleeping <sleep_time> seconds between iterations
./extract_gkp_gt.sh 1800     # e.g. 30-minute cycles

The loop variant is the production mode (typically launched once and left in a long-lived screen session, similar to how keep_running.sh is launched).

Interlock with `processing.py`¶

Both loops call wait_for_slurm_idle() before each cycle. This blocks until no process_keywords Slurm job is running, then proceeds. While waiting, the script sleeps 12 hours between rechecks.

The interlock exists to keep refresh traffic off laksa / prata / isog while the main pipeline is doing heavy CH work — both processing.py and the extractors hit the same ClickHouse instances. Running them simultaneously starves both.

Logs¶

Directory: /home/data/data_extract_logs/
GT tier 1: gt_tier1_YYYYMMDD_HHMMSS.log
GT tier 2: gt_tier2_YYYYMMDD_HHMMSS.log
GKP: gkp_YYYYMMDD_HHMMSS.log
Cleanup: files older than 30 days are deleted on each cycle (cleanup_old_logs).
Pointer files (/tmp mktemp) track the current log path so the hourly monitor knows where to append its summaries.

Failure modes¶

Either extractor crashes — the script's trap EXIT kills both background processes (GKP loop + hourly monitor) cleanly. No automatic restart; the operator has to relaunch.
process_keywords never idles — both loops block indefinitely, rechecking every 12 h. Pipeline must be healthy enough to leave windows of idleness, or refresh starves forever.
Log directory full / write failure — extractors will fail with output to the dead log; no special recovery.