Skip to content

Keyword-Volume Knowledge Base

End-to-end documentation of the keyword-volume calculation: data sources, ingestion, processing, decisions, methods, metrics, and serving.

Why this exists

The keyword-volume system blends multiple raw sources (GSC, Google Trends, GKP, JS) into a single ClickHouse-resident pipeline and emits derived metrics — volume, trend, growth, CPC, organic %, confidence — consumed by the product API, Position Explorer, and the demo dashboard. The logic lives across kvprocessor.cpp, processing.py, pe_update.py, and the upload step; the why behind each threshold, gate, and blend weight has historically lived only in people's heads and in scattered topic-specific markdown files.

This knowledge base unifies all of that:

  • System — what the pipeline does today (sources, tables, processes, fields, decisions, metrics, consumers).
  • Methods — the curated library of forecasting / signal-processing techniques the pipeline draws on (Croston, SBA, IMAPA, Holt-Winters, Savitzky-Golay, …). Each method links to the decision nodes that route to it.
  • Problems — symptom-first cross-index: "volume shows 0", "trend is flat", "growth tag says crashing but customer disagrees" → which decision/field nodes to inspect.

How to navigate

Two main entry points:

  1. Architecture — start here for the end-to-end pipeline diagram and prose narrative (sources → ingestion → processing → serving). Best for new contributors.
  2. Problems (by symptom) — start here when debugging a specific keyword's wrong-looking value.

Or browse by node type under System in the sidebar, or by topic under Methods.

How to read an entry

Every node has YAML frontmatter exposing its typed relations, followed by human prose:

---
title: <name>
type: source | table | process | field | decision | method | metric | consumer | concept
status: production | experimental | proposed | deprecated
tags: [topic-tags]
inputs: [upstream-nodes]
feeds: [downstream-nodes]
implemented_at: processing.py:L<line>
last_validated: YYYY-MM-DD
---

Body sections:

  • What it is — one paragraph, plain language.
  • How it works — the substantive explanation. For decisions: inputs, threshold, branches.
  • Why this choice — the rationale that today lives in tribal knowledge: past incidents, rejected alternatives, the "we tried X, was too noisy" stories.
  • Edge cases / known issues — numbered list.
  • Related — cross-links to other nodes.

Method entries additionally carry Source / Link / Retrieved / Fit for our model (citing processing.py:L…) per the curated-method convention.

Status legend (unified across the KB)

Status Meaning
🟩 production Live in the current pipeline.
🟧 experimental Piloted in a notebook, not yet in production.
🟦 proposed Captured as worth considering; not yet tried.
🟥 deprecated Was production, now removed or replaced (see superseded_by:).

Where this lives

  • Public URL (VPN required): https://keyword-volume-kb.yep.tools/
  • Source: data_science/keyword-volume/kb/ in the monorepo
  • Service: mkdocs running as systemd user unit keyword-volume-kb on isog, port 8090
  • Reverse proxy / TLS: Caddy on yepsand (config in ~/mimi/yep/ops/caddy/Caddyfile_tools)

Status of this KB

This documentation is being built in phases — see the plan in /home/data/.claude/plans/. The Methods section is fully populated (ported from the prior forecasting KB). The System section is being filled phase by phase; sub-sections show "TBD — Phase N" placeholders until their phase lands.

The original eight project-root .md files (readme.md, hybrid_growth.md, confidence_score.md, refresh_detection.md, REFRESH_PIPELINE.md, hybrid_growth_timeframes.md, pe_keywords.md, README_demo.md) live under docs/_archive/ and are being decomposed into typed nodes.