Skip to content

fix: detect upper-accented+currency mojibake at string start (fixes #222)#232

Open
gaoflow wants to merge 1 commit into
rspeer:mainfrom
gaoflow:main
Open

fix: detect upper-accented+currency mojibake at string start (fixes #222)#232
gaoflow wants to merge 1 commit into
rspeer:mainfrom
gaoflow:main

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 24, 2026

Copy link
Copy Markdown

The badness heuristic in is_bad() misses UTF-8 mojibake when an upper-accented letter followed by a currency symbol appears at the very beginning of a string. An existing pattern (\s [{upper_accented}] [{currency}]) already catches this when preceded by whitespace, but the start-of-string case was missing.

For example, fix_encoding("Ã¥klagarmyndighets") returns "Ã¥klagarmyndighets" unchanged instead of correctly decoding to "åklagarmyndighets".

Root cause: No BADNESS_RE pattern matches [{upper_accented}][{currency}] at position 0 without a preceding space or lowercase letter.

Fix: Add ^[{upper_accented}][{currency}]\w to detect this sequence at string start. The trailing \w is required so the pattern does not match the isolated 2-character substring that decode_inconsistent_utf8 passes to is_bad(), preventing false positives on ambiguous embedded sequences (like the "Dråber" negative test case from issue #202).

Changes:

  • ftfy/badness.py: Add new BADNESS_RE alternation (4 lines)
  • tests/test-cases/synthetic.json: Add test case for the reported bug

When an upper-accented letter (such as Ã) is followed by a currency
symbol (such as ¥, YEN SIGN) at the very beginning of a string, the
badness heuristic failed to detect it as mojibake.  An existing pattern
(`\s [{upper_accented}] [{currency}]`) already caught this case when
a preceding whitespace was present, but the start-of-string case was
missing.

Add a BADNESS_RE pattern `^[{upper_accented}][{currency}]\w` that
matches this sequence at position 0 only when followed by a word
character.  The trailing `\w` ensures the pattern does not match the
isolated 2-character substring that `decode_inconsistent_utf8` passes
to `is_bad()` during processing of other text, preventing false
positives on ambiguous embedded sequences like "Dråber".

Fixes rspeer#222.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant