Skip to content

feat(knowledge): surface semantic->keyword fallback_reason in search meta + telemetry (observability for #297)#301

Merged
mkreyman merged 1 commit into
masterfrom
feat/semantic-fallback-observability
Jul 4, 2026
Merged

feat(knowledge): surface semantic->keyword fallback_reason in search meta + telemetry (observability for #297)#301
mkreyman merged 1 commit into
masterfrom
feat/semantic-fallback-observability

Conversation

@mkreyman

@mkreyman mkreyman commented Jul 4, 2026

Copy link
Copy Markdown
Owner

What

Make the silent semantic->keyword-only fallback observable (enabling step for diagnosing #297). Does not change fallback behavior and does not attempt the root-cause fix — the root cause needs prod data + the tenant's embedding key set.

At ~77k-article scale every knowledge_search/context call silently returned fallback: true, search_mode: "keyword_only" as a 200, and the fallback clause discarded the embedding-error reason, so nobody could tell WHY semantic never engaged.

Changes

  • Capture the discarded reason in search_combined/3's fallback clause, the semantic-only keyless fallback (controller), and the /knowledge/context path. It is sanitized through Loopctl.Llm.ProviderError.sanitize/1 (from fix(embeddings): close 15 BYO review regressions — tenant-scoped/error-classed circuit breaker, provider-error sanitizer, permanent-error/idempotent embed, keyless degrade #296) FIRST, so no api key / provider body can reach the tag, telemetry, or logs. Mapping (bounded set): no_embedding_key, embedding_circuit_open, embedding_timeout, embedding_provider_error_<status> (status only), embedding_request_failed, embedding_crash, embedding_error.
  • Surface meta.fallback_reason alongside the existing fallback: true / search_mode: "keyword_only" on all three paths (JSON views + get_context propagation).
  • Distinguish embed-fail from recall-broken: when the embedding SUCCEEDS but semantic ranking returns ZERO rows (recall/HNSW, not an embed failure), combined mode reports meta.semantic_result_count: 0 with fallback: false and a distinct semantic_empty telemetry reason — this is the difference between "embed failed" and "embed worked but recall is broken", both candidate root causes for knowledge_search/context: semantic ranking never engages at ~77k scale — every query silently falls back to keyword_only (follow-up to #180) #297.
  • Telemetry + log: [:loopctl, :knowledge, :semantic_fallback] (measurements count/query_len/semantic_result_count; metadata tenant_id/reason — query TEXT never included, only its byte length, matching the sibling knowledge telemetry) + a Logger.warning naming reason + tenant. Turns a silent outage into an alertable one.
  • Prometheus counter loopctl.knowledge.semantic_fallback.count tagged by reason ONLY (no tenant_id label — reason is a small fixed set, so cardinality stays bounded per AC-27.15.3).
  • Docs: OpenApiSpex search + context response schemas document the new optional meta fields; MCP knowledge_search / knowledge_context tool descriptions mention fallback_reason (mcp-server 2.33.0 + CHANGELOG).

Tests

Per fallback cause (:no_api_key, :timeout, {:api_error, 401, body-with-a-fake-key}, genuinely-open breaker): assert the right fallback_reason in meta, that the reason/telemetry NEVER contain the fake key or provider body (sanitizer), and that the telemetry fires with the right reason (tenant-scoped handler, no cross-test leakage). Plus: the semantic_empty marker path, a successful semantic setting NEITHER fallback nor fallback_reason and firing no telemetry, context-path reason propagation, and the reason-only Prometheus counter shape.

Root-cause hypothesis for the follow-up (NOT fixed here)

Reading the fallback path closely, #297's "silent keyword_only at scale" most likely has two separable causes this PR now tells apart:

  1. Embedding always failing on the global/tenant key — mandatory BYO (feat(llm): extend per-tenant BYO to embeddings — mandatory tenant OpenAI key, gated enqueue, usage tracking #295/fix(embeddings): close 15 BYO review regressions — tenant-scoped/error-classed circuit breaker, provider-error sanitizer, permanent-error/idempotent embed, keyless degrade #296) means a tenant with no configured embedding key gets :no_api_key on EVERY combined search and silently degrades. If the ~77k-article tenant never had embedding_api_key set, fallback_reason: "no_embedding_key" will now show that immediately. This is my leading hypothesis — it matches "semantic never engages" for one tenant.
  2. Breaker latched open — if early failures tripped the tenant-scoped breaker, subsequent calls short-circuit to :circuit_open (embedding_circuit_open) without ever calling the provider. Secondary, but the tag will reveal it.
  3. Recall/HNSW empty — if the embedding SUCCEEDS but search_semantic returns zero (index not populated / articles unembedded / filter starves the pool), you'll now see semantic_result_count: 0 with fallback: false and reason: "semantic_empty" — a different fix (backfill embeddings / index) than 1–2.

The new signals let the follow-up diagnosis read the reason distribution from prod telemetry/logs and pick the right fix, rather than guessing.

Refs #297.

…meta + telemetry (observability for #297)

At ~77k-article scale every knowledge_search/context silently returned
`fallback: true, search_mode: "keyword_only"` and the fallback path DISCARDED
the embedding-error reason, making the outage undiagnosable. Capture that
reason (never the api key / provider body) and make the degradation observable.
Behavior is unchanged; keyword_only fallback is still the correct failure mode;
this only ADDs the reason/telemetry/log so a silent 200 becomes alertable.

- Map the discarded embedding error to a STABLE, non-sensitive tag via
  Loopctl.Llm.ProviderError.sanitize/1 (from #296): no_embedding_key,
  embedding_circuit_open, embedding_timeout, embedding_provider_error_<status>
  (status only), embedding_request_failed, embedding_crash, embedding_error.
- Surface it as meta.fallback_reason on the combined fallback, the
  semantic-only keyless fallback, and the /knowledge/context path.
- Distinguish the OTHER silent cause: embedding OK but semantic ranking returned
  ZERO rows (recall/HNSW, not embed failure) -> meta.semantic_result_count: 0
  with fallback: false + a distinct semantic_empty reason.
- Emit [:loopctl, :knowledge, :semantic_fallback] telemetry (measurements
  count/query_len/semantic_result_count; metadata tenant_id/reason; query TEXT
  never included, only its byte length) + a Logger.warning naming reason+tenant.
- Add a bounded Prometheus counter loopctl.knowledge.semantic_fallback.count
  tagged by reason ONLY (no tenant_id label, cardinality stays bounded).
- Document fallback_reason / semantic_result_count in the OpenApiSpex search +
  context response schemas and the MCP knowledge_search / knowledge_context tool
  descriptions (mcp-server 2.33.0 + CHANGELOG).

Tests: per-cause fallback_reason mapping, no key/body leak through the
sanitizer, telemetry fires with the right reason (tenant-scoped handler),
semantic_empty marker, and a successful semantic setting neither fallback nor
fallback_reason.

Refs #297.
@mkreyman mkreyman marked this pull request as ready for review July 4, 2026 02:23
@mkreyman mkreyman merged commit 39f0a90 into master Jul 4, 2026
10 checks passed
@mkreyman mkreyman deleted the feat/semantic-fallback-observability branch July 4, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant