Keyword-Volume Knowledge Base¶
End-to-end documentation of the keyword-volume calculation: data sources, ingestion, processing, decisions, methods, metrics, and serving.
Why this exists¶
The keyword-volume system blends multiple raw sources (GSC, Google Trends, GKP, JS) into a single ClickHouse-resident pipeline and emits derived metrics — volume, trend, growth, CPC, organic %, confidence — consumed by the product API, Position Explorer, and the demo dashboard. The logic lives across kvprocessor.cpp, processing.py, pe_update.py, and the upload step; the why behind each threshold, gate, and blend weight has historically lived only in people's heads and in scattered topic-specific markdown files.
This knowledge base unifies all of that:
- System — what the pipeline does today (sources, tables, processes, fields, decisions, metrics, consumers).
- Methods — the curated library of forecasting / signal-processing techniques the pipeline draws on (Croston, SBA, IMAPA, Holt-Winters, Savitzky-Golay, …). Each method links to the decision nodes that route to it.
- Problems — symptom-first cross-index: "volume shows 0", "trend is flat", "growth tag says crashing but customer disagrees" → which decision/field nodes to inspect.
How to navigate¶
Two main entry points:
- Architecture — start here for the end-to-end pipeline diagram and prose narrative (sources → ingestion → processing → serving). Best for new contributors.
- Problems (by symptom) — start here when debugging a specific keyword's wrong-looking value.
Or browse by node type under System in the sidebar, or by topic under Methods.
How to read an entry¶
Every node has YAML frontmatter exposing its typed relations, followed by human prose:
---
title: <name>
type: source | table | process | field | decision | method | metric | consumer | concept
status: production | experimental | proposed | deprecated
tags: [topic-tags]
inputs: [upstream-nodes]
feeds: [downstream-nodes]
implemented_at: processing.py:L<line>
last_validated: YYYY-MM-DD
---
Body sections:
- What it is — one paragraph, plain language.
- How it works — the substantive explanation. For decisions: inputs, threshold, branches.
- Why this choice — the rationale that today lives in tribal knowledge: past incidents, rejected alternatives, the "we tried X, was too noisy" stories.
- Edge cases / known issues — numbered list.
- Related — cross-links to other nodes.
Method entries additionally carry Source / Link / Retrieved / Fit for our model (citing processing.py:L…) per the curated-method convention.
Status legend (unified across the KB)¶
| Status | Meaning |
|---|---|
| 🟩 production | Live in the current pipeline. |
| 🟧 experimental | Piloted in a notebook, not yet in production. |
| 🟦 proposed | Captured as worth considering; not yet tried. |
| 🟥 deprecated | Was production, now removed or replaced (see superseded_by:). |
Where this lives¶
- Public URL (VPN required): https://keyword-volume-kb.yep.tools/
- Source:
data_science/keyword-volume/kb/in the monorepo - Service: mkdocs running as systemd user unit
keyword-volume-kbonisog, port 8090 - Reverse proxy / TLS: Caddy on
yepsand(config in~/mimi/yep/ops/caddy/Caddyfile_tools)
Status of this KB¶
This documentation is being built in phases — see the plan in /home/data/.claude/plans/. The Methods section is fully populated (ported from the prior forecasting KB). The System section is being filled phase by phase; sub-sections show "TBD — Phase N" placeholders until their phase lands.
The original eight project-root .md files (readme.md, hybrid_growth.md, confidence_score.md, refresh_detection.md, REFRESH_PIPELINE.md, hybrid_growth_timeframes.md, pe_keywords.md, README_demo.md) live under docs/_archive/ and are being decomposed into typed nodes.