feat(resolver): asset classifier mapping identifiers to specific taxonomy#33
feat(resolver): asset classifier mapping identifiers to specific taxonomy#33robertoecf wants to merge 7 commits into
Conversation
…xonomy Add resolve_asset(): turns any Brazilian asset identifier (ticker/CNPJ/ISIN/ name) into a classification mapped to the consolidation macro taxonomy (RF/RV/Multimercado/Alternativos/Estruturados) plus an orthogonal exposure axis (Brasil/Internacional), subclasse, underlying_nature, debenture Lei-12.431 facts, source, confidence, and an audit cascade. Deterministic, cacheable, no PII. Core is offline (curated ETF/global-fund seed + structural rules); external providers (Mais Retorno / CVM-B3 / web search) are an injectable fallback chain. Exposed over REST (/resolver/resolve) and as the resolve_asset MCP tool. Passes the Wealthuman spec test set: IFRA11→RF/Indexada à Inflação, ARBOR/WHG→RV+Internacional, DEB PETROBRAS IPCA+→RF incentivada+isento, COE→Estruturados (never ETF), "Crédito Estruturado"→RF (name-trap), IVVB11→RV+Internacional, FIIs→RV. Geography is modeled as the exposure axis, not a macro class (macro_class is pure asset class). Hardened after cross-host adversarial review: heuristic Lei-12.431 isento kept below the cascade short-circuit so a provider confirms by ISIN; bare-token collisions (IE/LC/LF, substring LCI) removed; API length caps; as_of stamped in America/Sao_Paulo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (6)
📝 WalkthroughWalkthroughAdds a new resolver package for deterministic classification of Brazilian asset identifiers, wires it into REST and MCP surfaces, and expands tests, docs, and configuration to match the new resolver behavior. ChangesAsset-classification resolver
Sequence DiagramsequenceDiagram
participant Client
participant FastAPI as /resolver/resolve
participant resolve_asset
participant classify
participant lookup_seed
participant _rule_payload
participant AssetProvider
Client->>FastAPI: GET /resolver/resolve
FastAPI->>resolve_asset: optional identifiers
resolve_asset->>classify: normalized input
classify->>lookup_seed: seed match lookup
lookup_seed-->>classify: seed or none
classify->>_rule_payload: structural rules
_rule_payload-->>classify: payload
classify-->>resolve_asset: AssetClassification
loop optional providers
resolve_asset->>AssetProvider: norm, current
AssetProvider-->>resolve_asset: enriched result or None
end
resolve_asset-->>FastAPI: AssetClassification
FastAPI-->>Client: JSON response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 markdownlint-cli2 (0.22.1)CHANGELOG.mdmarkdownlint-cli2 wrapper config was not available before execution docs/MCP_SURFACE.mdmarkdownlint-cli2 wrapper config was not available before execution docs/RESOLVER.mdmarkdownlint-cli2 wrapper config was not available before execution Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new asset-classification resolver module and its corresponding API endpoints and MCP tool, allowing Brazilian assets to be mapped to a macro taxonomy and exposure axis. The code review feedback focuses on enhancing the robustness of the rule-based classification engine. Key recommendations include restricting keywords like 'PARTICIPACOES', 'MACRO', 'ACOES', and 'EQUITY' to a fund context to prevent collisions with direct equities or commercial names. Additionally, the feedback suggests mapping federal public bond abbreviations (e.g., LFT, LTN) to their respective indexers to avoid incorrect classification as 'Crédito Privado', and removing a redundant string folding operation in the seed lookup.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/MCP_SURFACE.md (1)
56-72: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winThe heading count and the listed tools disagree.
This section is titled "The 25 curated tools", but the block still includes the optional
findata_run_code, so the rendered list shows 26 names. Either move the code-mode tool below this section or rename the heading to make the 25+1 split explicit.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/MCP_SURFACE.md` around lines 56 - 72, The section title and tool list are inconsistent because the `findata_run_code` entry is still included under “The 25 curated tools,” making the rendered count 26. Update the `MCP_SURFACE` tool catalog block by either removing `findata_run_code` from this section and placing it elsewhere as the opt-in code-mode tool, or renaming the heading to clearly indicate a 25+1 split so the count matches the listed tools.
🧹 Nitpick comments (3)
tests/test_mcp_surface.py (1)
22-23: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the new resolver surface explicitly, not just the totals.
Bumping the counts alone will still pass if some unrelated route/tool is added while
resolve_assetstays unwired. Since this PR's contract is the new resolver surface, please pin it directly in the test.Example strengthening
def test_curated_mcp_is_a_small_fraction_of_the_rest_surface() -> None: mcp_ids = _operation_ids(mcp_app) rest_ids = _operation_ids(app) + assert "resolve_asset" in mcp_ids + assert "/resolver/resolve" in app.openapi()["paths"] assert len(mcp_ids) == EXPECTED_TOOLS assert len(rest_ids) == EXPECTED_REST_OPERATIONS🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_mcp_surface.py` around lines 22 - 23, The surface test currently only checks aggregate tool and REST counts, so it can pass even if the new resolver is not actually wired in. Update the assertions in test_mcp_surface.py to explicitly verify the resolver surface by name, including resolve_asset and its related MCP/REST exposure, alongside the existing totals. Use the existing EXPECTED_TOOLS, EXPECTED_REST_OPERATIONS, and the test that enumerates the surface to pin the new contract directly instead of relying on count-only coverage.pyproject.toml (1)
140-142: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPrefer a targeted Ruff suppression over a file-wide ignore.
This disables complexity checks for every future function in
src/findata/resolver/engine.py, not just the intentional rule cascade. A function-level# noqaon the specific dispatcher(s) keeps the exception local and preserves signal for new helpers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` around lines 140 - 142, Move the Ruff suppression out of the pyproject.toml file-wide ignore and localize it to the intentional dispatcher in src/findata/resolver/engine.py. Keep the complexity exceptions only on the specific function(s) implementing the classification cascade, using a function-level noqa on the relevant dispatcher symbol instead of suppressing C901, PLR0912, and PLR0911 for the entire module.src/findata/resolver/seed.py (1)
146-150: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick winFail fast on duplicate curated identifiers.
These dict comprehensions silently let the last duplicate ticker/CNPJ win. One accidental duplicate in
SEED_ENTRIESwill change resolver output with no signal, which is risky for a hand-maintained taxonomy seed.Proposed fix
+def _build_unique_index(entries: list[SeedEntry], attr: str) -> dict[str, SeedEntry]: + index: dict[str, SeedEntry] = {} + for entry in entries: + key = getattr(entry, attr) + if not key: + continue + if key in index: + raise ValueError(f"Duplicate seed {attr}: {key}") + index[key] = entry + return index + + SEED_ENTRIES: list[SeedEntry] = _build_etf_seed() + _GLOBAL_FUNDS # Index by ticker for O(1) hits (the common path). -_BY_TICKER: dict[str, SeedEntry] = {e.ticker: e for e in SEED_ENTRIES if e.ticker} -_BY_CNPJ: dict[str, SeedEntry] = {e.cnpj: e for e in SEED_ENTRIES if e.cnpj} +_BY_TICKER: dict[str, SeedEntry] = _build_unique_index(SEED_ENTRIES, "ticker") +_BY_CNPJ: dict[str, SeedEntry] = _build_unique_index(SEED_ENTRIES, "cnpj")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/findata/resolver/seed.py` around lines 146 - 150, The seed indexes in the resolver silently overwrite duplicate tickers/CNPJs when building _BY_TICKER and _BY_CNPJ from SEED_ENTRIES, so add an explicit duplicate check during initialization and fail fast if any curated identifier repeats. Update the logic around _build_etf_seed, _GLOBAL_FUNDS, and the _BY_TICKER/_BY_CNPJ construction so duplicates are detected before the dicts are created, and raise a clear error naming the repeated identifier and its SeedEntry source.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/MCP_SURFACE.md`:
- Line 58: The fenced block in the MCP_SURFACE documentation is missing a
language identifier and is triggering MD040; update that markdown fence to
include an appropriate language tag by locating the affected fenced block in the
docs content and adding the missing specifier so the Markdown linter passes.
In `@src/findata/api/mcp_app.py`:
- Around line 101-126: The resolve_asset_tool endpoint metadata is advertising
the wrong contract by including Intl/Internacional in the macro taxonomy. Update
the router summary and the resolve_asset_tool docstring to match the actual
resolver output and wording, keeping the taxonomy focused on the real macro
classes and moving any geography-related concept to exposure instead. Use the
resolve_asset_tool function and its response description as the source of truth
so the text clearly matches what resolve_asset returns.
In `@src/findata/api/routers/resolver.py`:
- Around line 27-33: Update the resolver route docstring in the resolver module
so the documented macro_class values no longer include Internacional. Keep the
description aligned with the current contract by reflecting that geography is
now represented via exposure, and make sure the text around the
resolver/classification behavior matches the symbols used by the API route (for
example, the resolver docstring and its macro_class field description).
In `@src/findata/resolver/__init__.py`:
- Around line 3-7: The public resolver docs still list Internacional as a macro
class, but the contract now uses geography only in exposure. Update the
documentation tied to resolve_asset in __init__.py so the macro taxonomy only
includes the actual macro_class values and describes Internacional as an
exposure attribute instead. Keep the rest of the resolver contract wording
aligned with the symbols resolve_asset and its returned classification fields.
In `@src/findata/resolver/engine.py`:
- Around line 440-475: The ticker-only fallback in the resolver engine is too
aggressive for ambiguous B3 suffixes and should not force every non-BDR symbol
into a specific asset class. Update the fallback logic in the ticker branch of
the resolver so only high-confidence cases like known BDR suffixes stay
classified, and ambiguous suffixes such as units or other non-share codes return
Indefinido to let the HITL/provider cascade decide. Keep the change localized
around the ticker suffix handling in the same resolver method and preserve the
existing confidence/notes pattern for truly safe classifications.
- Around line 223-248: The debenture classification in
classify()/resolve_asset() is setting tax-exempt fields too early on the
heuristic Lei 12.431 path. Update the logic around _infer_incentivada so that
deb["incentivada_1243"] and tax["isento"] are only populated when the basis is
explicit (provider-confirmed by ISIN), and leave them unset for the heuristic
issuer+IPCA branch. Keep the confidence handling and other classification fields
unchanged.
- Around line 557-565: The optional enrichment loop in resolve_asset() currently
lets a single provider exception abort the whole classification path. Wrap each
await provider(norm, result) call in per-provider exception handling inside the
provider loop, keep the last successful result unchanged on failure, and
continue to the next provider; preserve the existing early-stop check and the
cascade merge behavior in the resolve_asset logic.
- Around line 530-555: The deterministic offline classification path in
classify()/resolve_asset() ignores ISIN even though normalize() preserves it, so
ISIN-only inputs still resolve to Indefinido unless an external provider runs.
Update the core path to use norm.isin in the seed lookup and/or _rule_payload
logic, or remove isin from the public resolve_asset() surface if it is not meant
to be supported. Make the change in the classify and _rule_payload flow so
bare-ISIN assets are handled consistently without relying on providers.
In `@src/findata/resolver/models.py`:
- Around line 74-124: The resolver contract models currently allow unknown
fields to pass silently, which can hide typos and produce incomplete
classifications. Update IdentifierResolved, CvmInfo, DebentureInfo, TaxInfo, and
AssetClassification to use strict Pydantic config with extra fields forbidden,
ideally via a shared base or ConfigDict(extra="forbid"), so payloads passed into
DebentureInfo and TaxInfo fail validation on unexpected keys.
In `@src/findata/resolver/normalize.py`:
- Around line 61-68: The predicate helpers in Normalize are too dependent on
callers pre-folding inputs; update has_token() and name_contains() to normalize
their candidate arguments internally before matching against self.tokens and
self.name_folded. Keep the behavior of the Normalize class consistent by
applying the same folding/normalization logic used elsewhere in the resolver so
natural-string callers work correctly without silently missing matches.
In `@src/findata/resolver/seed.py`:
- Around line 154-168: The name-based lookup in lookup_seed is using raw
substring checks, so short markers like FIA can incorrectly match longer tokens
such as FIAGRO and return the wrong SeedEntry. Update the matching logic in
lookup_seed and/or the SeedEntry name_substrings handling to be token-aware or
boundary-aware for structural markers, while preserving the existing ticker/CNPJ
exact-match behavior and the fold-based normalization.
---
Outside diff comments:
In `@docs/MCP_SURFACE.md`:
- Around line 56-72: The section title and tool list are inconsistent because
the `findata_run_code` entry is still included under “The 25 curated tools,”
making the rendered count 26. Update the `MCP_SURFACE` tool catalog block by
either removing `findata_run_code` from this section and placing it elsewhere as
the opt-in code-mode tool, or renaming the heading to clearly indicate a 25+1
split so the count matches the listed tools.
---
Nitpick comments:
In `@pyproject.toml`:
- Around line 140-142: Move the Ruff suppression out of the pyproject.toml
file-wide ignore and localize it to the intentional dispatcher in
src/findata/resolver/engine.py. Keep the complexity exceptions only on the
specific function(s) implementing the classification cascade, using a
function-level noqa on the relevant dispatcher symbol instead of suppressing
C901, PLR0912, and PLR0911 for the entire module.
In `@src/findata/resolver/seed.py`:
- Around line 146-150: The seed indexes in the resolver silently overwrite
duplicate tickers/CNPJs when building _BY_TICKER and _BY_CNPJ from SEED_ENTRIES,
so add an explicit duplicate check during initialization and fail fast if any
curated identifier repeats. Update the logic around _build_etf_seed,
_GLOBAL_FUNDS, and the _BY_TICKER/_BY_CNPJ construction so duplicates are
detected before the dicts are created, and raise a clear error naming the
repeated identifier and its SeedEntry source.
In `@tests/test_mcp_surface.py`:
- Around line 22-23: The surface test currently only checks aggregate tool and
REST counts, so it can pass even if the new resolver is not actually wired in.
Update the assertions in test_mcp_surface.py to explicitly verify the resolver
surface by name, including resolve_asset and its related MCP/REST exposure,
alongside the existing totals. Use the existing EXPECTED_TOOLS,
EXPECTED_REST_OPERATIONS, and the test that enumerates the surface to pin the
new contract directly instead of relying on count-only coverage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6929b1a2-994c-4b49-8a04-fb9f422ff799
📒 Files selected for processing (14)
CHANGELOG.mddocs/MCP_SURFACE.mdpyproject.tomlsrc/findata/api/app.pysrc/findata/api/mcp_app.pysrc/findata/api/routers/resolver.pysrc/findata/resolver/__init__.pysrc/findata/resolver/engine.pysrc/findata/resolver/models.pysrc/findata/resolver/normalize.pysrc/findata/resolver/seed.pysrc/findata/resolver/taxonomy.pytests/test_mcp_surface.pytests/test_resolver.py
…tion The flat incentivada_1243/isento bools cannot tell a structurally certain infra signal (explicit name / FI-Infra ETF) apart from a weak issuer+IPCA heuristic. Add two Literal status fields that carry that certainty: - DebentureInfo.lei_12431_status: confirmed | candidate | not_applicable | unknown - TaxInfo.isento_status: confirmed_exempt | candidate_exempt | confirmed_taxable | unknown Existing bool fields are kept unchanged (spec/test contract). Engine debenture rule, CRA/CRI and LCI/LCA statutory exemptions, and the IFRA11 seed now stamp these statuses; Tesouro-backed RF ETFs get not_applicable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A non-debenture ETF should not carry a stub DebentureInfo just to hold lei_12431_status=not_applicable; None is the honest 'no debenture facts' shape. Status not_applicable remains on actual non-infra debentures. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a machine-readable `signals` list to AssetClassification alongside the free-text `notes` and coarse `cascade`, so an auditor sees WHICH rule fired and WHAT concrete token/phrase/ticker matched. Each `_rule_payload` branch records its rule id + real matched evidence (via `_first_matching_token`/`_phrase` helpers); the curated-seed path synthesizes a `curated_seed` signal in `classify()` without mutating the frozen seed entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cross-host CI reviewers (Gemini, CodeRabbit) on PR #33: - Gate the FIP/Multimercado/FIA keyword paths ("PARTICIPACOES", "MACRO", "ACOES"/"EQUITY") on a fund context so company/trade names ("Randon Participações SA", "Macro Atacadista") are not misclassified as funds. - Derive the Tesouro subclasse from the bond code when the name lacks the index word: NTN-B → Indexada à Inflação, LFT → Pós-fixada, LTN/NTN-F → Prefixada; unmapped public bonds → "Título Público", never "Crédito Privado". - Drop the redundant second fold() in lookup_seed (input is pre-folded). - Fix doc drift: macro_class no longer lists "Internacional" (it is the exposure axis) in the REST router, MCP tool summary/docstring, package docstring, and MCP_SURFACE.md; add a language to the fenced block. 6 new regression tests. 272 → 278 passed, 15 deselected. ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…deliverable) Add docs/RESOLVER.md: the MCP/REST/Python contract the consolidator calls — input, full output schema with the two-axis model (asset class + exposure), fiscal-certainty status, cascade, verified test-set table, and the pre-prod pending list. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- models: strict ConfigDict(extra="forbid") base for the contract models, so a
typo in an internally-built DebentureInfo/TaxInfo payload fails loudly instead
of silently dropping a field.
- engine: per-provider try/except in the cascade — a raising provider logs
provider_error:<Type> on the cascade and continues, never nuking the
deterministic core result.
- engine: tighten the ticker fallback — only suffixes 3-8 map to Ações; other
suffixes (subscription rights / odd codes) defer to Indefinido/HITL. Bare ...11
stays FII (spec test set: HGLG11/MXRF11).
- seed: token-aware name-marker matching, so ("ARBOR","FIA") matches
"ARBOR FIC FIA" but not "ARBOR FIAGRO" (FIA not a token of FIAGRO).
- normalize: fold candidate args inside has_token/name_contains so natural-string
callers cannot silently miss rules.
- tests: regressions for each of the above.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
O que
resolve_asset(identifier)— resolver determinístico que mapeia qualquer identificador de ativo brasileiro (ticker/CNPJ/ISIN/nome) para a taxonomia de consolidação (cliente: Wealthuman). Exposto como biblioteca (findata.resolver.resolve_asset()), REST (GET /resolver/resolve) e tool MCPresolve_asset.📄 Contrato/entrega para o demandante:
docs/RESOLVER.md.Dois eixos (decisão de modelo)
macro_class= classe de ativo pura: RF / RV / Multimercado / Alternativos / Estruturados (+ Indefinido).exposure= geografia ortogonal: Brasil / Internacional. B3 = domicílio do ativo, não a exposição → IVVB11, BDR, FIA global = RV + Internacional.Campos de auditoria
lei_12431_status(confirmed/candidate/not_applicable/unknown) +isento_status. Heurística Petrobras →candidate(confirmar ISIN); IFRA11/infra explícito →confirmed; CRA/CRI/LCI-LCA →confirmed_exempt.signals[]: trilha estruturada — qual regra disparou + evidência concreta (ex.:debenture / DEB / basis=heuristic;indexador=IPCA+).source+confidence+as_of(America/Sao_Paulo) +cascade.Cascata
Núcleo
openfindata(curated → rules) resolve o test set offline; providers externos (Mais Retorno MCP / CVM-B3 / web search restrito) são ponto de extensão injetável (AssetProvider), consultado só quando o resultado é fraco. No estado atual só o degrau 1 está ligado (externos = stubs a conectar no deploy).Test set (passa 100%, offline)
Reviews aplicadas
candidate+ confidence 0.7 abaixo do short-circuit;IE/LC/LFcrus, substringLCI, seedARBOR, ETF RF underlying,as_offuso → corrigidos.PARTICIPACOES/MACRO/ACOESem contexto de fundo (evita nomes de empresa), subclasse de Tesouro pelo código do título (NTN-B → inflação), double-fold, e doc drift (macro_class não lista mais "Internacional").Validação
Commits
feat(resolver)— classificador base + taxonomia + cascatafeat(resolver)— eixo de certeza fiscal (lei_12431_status / isento_status)fix(resolver)— debenture=None pra ETF de Tesourofeat(resolver)— trilha estruturada signals[]fix(resolver)— review-bot findings (collisions, Tesouro subclasse) + doc driftdocs(resolver)— contrato client-facing (docs/RESOLVER.md)Follow-ups restantes (pós-merge)
openfindata-mcp-spec.md) no repo do cliente — handoff já preparado🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation