Bot-like threshold (cross-source upstream masking)¶
The first of two conditions that gate the cross-source upstream bot masking path. Asks: "is the peak so extreme that, given the surrounding baseline, this is unambiguously not a real surge?" If yes AND recovery is also observed, the months get masked out of GSC, GT, and GKP arrays before blending — they're replaced with interpolated values so the blend never sees the bot inflation.
Distinct from the in-blend spike detector — that one substitutes a single GSC month with a GT-implied value during blending and records to bot_spike_months meta. This one rewrites all three source arrays before blending begins, and the change is permanent for the iteration.
What it is¶
def _is_bot_like(run, pre_window=12, post_window=6, bot_threshold=500.0):
pre = [v for v in vals[max(0, start-12) : start] if v > 0]
post = [v for v in vals[end+1 : end+1+6] if v > 0]
if len(post) < 3:
return False # not enough post-spike data, defer
reference = max(max(pre) if pre else 0, max(post))
if reference == 0:
return True
return (run['peak_value'] / reference) >= 500.0
A "run" is a contiguous span of months where the value is ≥ 10× the source baseline. For each run, the function compares the peak value to the max of pre-spike-window and post-spike-window non-zero values — the run must be 500× larger than the largest neighboring real value to qualify as bot-like.
The function returns False when fewer than 3 post-spike months exist — the decision is deferred, not denied. A keyword whose spike hasn't yet shown its tail gets re-evaluated on the next iteration when post-data lands.
How it's computed¶
At processing.py:KB-ANCHOR:bot-like-threshold. Called from detect_transient_cross_source_spikes() once per source run that overlaps a multi-source cross-spike cluster. The function passes / fails on its own; the masking only fires if both _is_bot_like AND _has_recovery return true for at least one participating run, AND ≥ 2 sources have spike runs at the same month, AND no participating run is seasonal.
Why this choice¶
Aggressive masking → only mask the obvious cases. Because the cross-source path destructively rewrites source arrays before blending, false-positives here are expensive: an incorrectly-masked real surge replaces actual data with an interpolation, hiding genuine signal from the rest of the pipeline.
500× is a deliberately conservative threshold. Anything below 500× peak/baseline gets handled by the gentler in-blend single-month detector — which uses a 10× factor but only substitutes one month and records the substitution, leaving the source arrays untouched.
Two paths, two risk profiles:
| Path | Threshold | Side effect | When it fires |
|---|---|---|---|
| In-blend (gsc-spike-factor) | 10× median | Substitutes one month's value; records to bot_spike_months |
During GT-blend, per-month |
| Upstream cross-source (this page) | 500× peak/baseline | Rewrites GSC, GT, GKP arrays via interpolation | Before blending, multi-source coincidence required |
Edge cases¶
- Fewer than 3 post-spike months — function returns
False, the run is deferred. The keyword's bot status will be revisited in a future iteration once data accumulates. Critical for catching very recent bot incursions: we'd rather wait one cycle than mistakenly mask a real surge whose tail hasn't materialized yet. reference == 0— neither pre- nor post-window has any non-zero values. ReturnsTrue(definitely bot-like — there's nothing else in the keyword's history at all).- Pre-window shorter than 12 — values from before the keyword existed are silently absent;
max(pre)adapts. Newly-tracked keywords with limited history rely on the post-window alone.
See also¶
- Recovery window gate — the companion condition that must also hold for masking to fire
- GSC spike factor — the other bot-detection path (gentler, in-blend)
detect_transient_cross_source_spikes()— the orchestrating function (processing.py:608)