Skip to content

Bot-like threshold (cross-source upstream masking)

The first of two conditions that gate the cross-source upstream bot masking path. Asks: "is the peak so extreme that, given the surrounding baseline, this is unambiguously not a real surge?" If yes AND recovery is also observed, the months get masked out of GSC, GT, and GKP arrays before blending — they're replaced with interpolated values so the blend never sees the bot inflation.

Distinct from the in-blend spike detector — that one substitutes a single GSC month with a GT-implied value during blending and records to bot_spike_months meta. This one rewrites all three source arrays before blending begins, and the change is permanent for the iteration.

What it is

def _is_bot_like(run, pre_window=12, post_window=6, bot_threshold=500.0):
    pre  = [v for v in vals[max(0, start-12) : start] if v > 0]
    post = [v for v in vals[end+1 : end+1+6]          if v > 0]
    if len(post) < 3:
        return False                                    # not enough post-spike data, defer
    reference = max(max(pre) if pre else 0, max(post))
    if reference == 0:
        return True
    return (run['peak_value'] / reference) >= 500.0

A "run" is a contiguous span of months where the value is ≥ 10× the source baseline. For each run, the function compares the peak value to the max of pre-spike-window and post-spike-window non-zero values — the run must be 500× larger than the largest neighboring real value to qualify as bot-like.

The function returns False when fewer than 3 post-spike months exist — the decision is deferred, not denied. A keyword whose spike hasn't yet shown its tail gets re-evaluated on the next iteration when post-data lands.

How it's computed

At processing.py:KB-ANCHOR:bot-like-threshold. Called from detect_transient_cross_source_spikes() once per source run that overlaps a multi-source cross-spike cluster. The function passes / fails on its own; the masking only fires if both _is_bot_like AND _has_recovery return true for at least one participating run, AND ≥ 2 sources have spike runs at the same month, AND no participating run is seasonal.

Why this choice

Aggressive masking → only mask the obvious cases. Because the cross-source path destructively rewrites source arrays before blending, false-positives here are expensive: an incorrectly-masked real surge replaces actual data with an interpolation, hiding genuine signal from the rest of the pipeline.

500× is a deliberately conservative threshold. Anything below 500× peak/baseline gets handled by the gentler in-blend single-month detector — which uses a 10× factor but only substitutes one month and records the substitution, leaving the source arrays untouched.

Two paths, two risk profiles:

Path Threshold Side effect When it fires
In-blend (gsc-spike-factor) 10× median Substitutes one month's value; records to bot_spike_months During GT-blend, per-month
Upstream cross-source (this page) 500× peak/baseline Rewrites GSC, GT, GKP arrays via interpolation Before blending, multi-source coincidence required

Edge cases

  • Fewer than 3 post-spike months — function returns False, the run is deferred. The keyword's bot status will be revisited in a future iteration once data accumulates. Critical for catching very recent bot incursions: we'd rather wait one cycle than mistakenly mask a real surge whose tail hasn't materialized yet.
  • reference == 0 — neither pre- nor post-window has any non-zero values. Returns True (definitely bot-like — there's nothing else in the keyword's history at all).
  • Pre-window shorter than 12 — values from before the keyword existed are silently absent; max(pre) adapts. Newly-tracked keywords with limited history rely on the post-window alone.

See also

  • Recovery window gate — the companion condition that must also hold for masking to fire
  • GSC spike factor — the other bot-detection path (gentler, in-blend)
  • detect_transient_cross_source_spikes() — the orchestrating function (processing.py:608)