fix: detect upper-accented+currency mojibake at string start (fixes #222) by gaoflow · Pull Request #232 · rspeer/python-ftfy

gaoflow · 2026-06-24T10:25:26Z

The badness heuristic in is_bad() misses UTF-8 mojibake when an upper-accented letter followed by a currency symbol appears at the very beginning of a string. An existing pattern (\s [{upper_accented}] [{currency}]) already catches this when preceded by whitespace, but the start-of-string case was missing.

For example, fix_encoding("Ã¥klagarmyndighets") returns "Ã¥klagarmyndighets" unchanged instead of correctly decoding to "åklagarmyndighets".

Root cause: No BADNESS_RE pattern matches [{upper_accented}][{currency}] at position 0 without a preceding space or lowercase letter.

Fix: Add ^[{upper_accented}][{currency}]\w to detect this sequence at string start. The trailing \w is required so the pattern does not match the isolated 2-character substring that decode_inconsistent_utf8 passes to is_bad(), preventing false positives on ambiguous embedded sequences (like the "DrÃ¥ber" negative test case from issue #202).

Changes:

ftfy/badness.py: Add new BADNESS_RE alternation (4 lines)
tests/test-cases/synthetic.json: Add test case for the reported bug

When an upper-accented letter (such as Ã) is followed by a currency symbol (such as ¥, YEN SIGN) at the very beginning of a string, the badness heuristic failed to detect it as mojibake. An existing pattern (`\s [{upper_accented}] [{currency}]`) already caught this case when a preceding whitespace was present, but the start-of-string case was missing. Add a BADNESS_RE pattern `^[{upper_accented}][{currency}]\w` that matches this sequence at position 0 only when followed by a word character. The trailing `\w` ensures the pattern does not match the isolated 2-character substring that `decode_inconsistent_utf8` passes to `is_bad()` during processing of other text, preventing false positives on ambiguous embedded sequences like "DrÃ¥ber". Fixes rspeer#222.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect upper-accented+currency mojibake at string start (fixes #222)#232

fix: detect upper-accented+currency mojibake at string start (fixes #222)#232
gaoflow wants to merge 1 commit into
rspeer:mainfrom
gaoflow:main

gaoflow commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaoflow commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant