feat(knowledge): surface semantic->keyword fallback_reason in search meta + telemetry (observability for #297)#301
Merged
Conversation
…meta + telemetry (observability for #297) At ~77k-article scale every knowledge_search/context silently returned `fallback: true, search_mode: "keyword_only"` and the fallback path DISCARDED the embedding-error reason, making the outage undiagnosable. Capture that reason (never the api key / provider body) and make the degradation observable. Behavior is unchanged; keyword_only fallback is still the correct failure mode; this only ADDs the reason/telemetry/log so a silent 200 becomes alertable. - Map the discarded embedding error to a STABLE, non-sensitive tag via Loopctl.Llm.ProviderError.sanitize/1 (from #296): no_embedding_key, embedding_circuit_open, embedding_timeout, embedding_provider_error_<status> (status only), embedding_request_failed, embedding_crash, embedding_error. - Surface it as meta.fallback_reason on the combined fallback, the semantic-only keyless fallback, and the /knowledge/context path. - Distinguish the OTHER silent cause: embedding OK but semantic ranking returned ZERO rows (recall/HNSW, not embed failure) -> meta.semantic_result_count: 0 with fallback: false + a distinct semantic_empty reason. - Emit [:loopctl, :knowledge, :semantic_fallback] telemetry (measurements count/query_len/semantic_result_count; metadata tenant_id/reason; query TEXT never included, only its byte length) + a Logger.warning naming reason+tenant. - Add a bounded Prometheus counter loopctl.knowledge.semantic_fallback.count tagged by reason ONLY (no tenant_id label, cardinality stays bounded). - Document fallback_reason / semantic_result_count in the OpenApiSpex search + context response schemas and the MCP knowledge_search / knowledge_context tool descriptions (mcp-server 2.33.0 + CHANGELOG). Tests: per-cause fallback_reason mapping, no key/body leak through the sanitizer, telemetry fires with the right reason (tenant-scoped handler), semantic_empty marker, and a successful semantic setting neither fallback nor fallback_reason. Refs #297.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make the silent semantic->keyword-only fallback observable (enabling step for diagnosing #297). Does not change fallback behavior and does not attempt the root-cause fix — the root cause needs prod data + the tenant's embedding key set.
At ~77k-article scale every
knowledge_search/contextcall silently returnedfallback: true, search_mode: "keyword_only"as a 200, and the fallback clause discarded the embedding-error reason, so nobody could tell WHY semantic never engaged.Changes
search_combined/3's fallback clause, the semantic-only keyless fallback (controller), and the/knowledge/contextpath. It is sanitized throughLoopctl.Llm.ProviderError.sanitize/1(from fix(embeddings): close 15 BYO review regressions — tenant-scoped/error-classed circuit breaker, provider-error sanitizer, permanent-error/idempotent embed, keyless degrade #296) FIRST, so no api key / provider body can reach the tag, telemetry, or logs. Mapping (bounded set):no_embedding_key,embedding_circuit_open,embedding_timeout,embedding_provider_error_<status>(status only),embedding_request_failed,embedding_crash,embedding_error.meta.fallback_reasonalongside the existingfallback: true/search_mode: "keyword_only"on all three paths (JSON views + get_context propagation).meta.semantic_result_count: 0withfallback: falseand a distinctsemantic_emptytelemetry reason — this is the difference between "embed failed" and "embed worked but recall is broken", both candidate root causes for knowledge_search/context: semantic ranking never engages at ~77k scale — every query silently falls back to keyword_only (follow-up to #180) #297.[:loopctl, :knowledge, :semantic_fallback](measurementscount/query_len/semantic_result_count; metadatatenant_id/reason— query TEXT never included, only its byte length, matching the sibling knowledge telemetry) + aLogger.warningnaming reason + tenant. Turns a silent outage into an alertable one.loopctl.knowledge.semantic_fallback.counttagged byreasonONLY (notenant_idlabel — reason is a small fixed set, so cardinality stays bounded per AC-27.15.3).knowledge_search/knowledge_contexttool descriptions mentionfallback_reason(mcp-server 2.33.0 + CHANGELOG).Tests
Per fallback cause (
:no_api_key,:timeout,{:api_error, 401, body-with-a-fake-key}, genuinely-open breaker): assert the rightfallback_reasonin meta, that the reason/telemetry NEVER contain the fake key or provider body (sanitizer), and that the telemetry fires with the right reason (tenant-scoped handler, no cross-test leakage). Plus: thesemantic_emptymarker path, a successful semantic setting NEITHERfallbacknorfallback_reasonand firing no telemetry, context-path reason propagation, and the reason-only Prometheus counter shape.Root-cause hypothesis for the follow-up (NOT fixed here)
Reading the fallback path closely, #297's "silent keyword_only at scale" most likely has two separable causes this PR now tells apart:
:no_api_keyon EVERY combined search and silently degrades. If the ~77k-article tenant never hadembedding_api_keyset,fallback_reason: "no_embedding_key"will now show that immediately. This is my leading hypothesis — it matches "semantic never engages" for one tenant.:circuit_open(embedding_circuit_open) without ever calling the provider. Secondary, but the tag will reveal it.search_semanticreturns zero (index not populated / articles unembedded / filter starves the pool), you'll now seesemantic_result_count: 0withfallback: falseandreason: "semantic_empty"— a different fix (backfill embeddings / index) than 1–2.The new signals let the follow-up diagnosis read the reason distribution from prod telemetry/logs and pick the right fix, rather than guessing.
Refs #297.