Skip to content

Commit 1982bba

Browse files
committed
20250728_00 - Release
- Switched from Unidecode to ftfy: Replaced aggressive Unicode-to-ASCII conversion with intelligent text fixing - Preserves Extended ASCII: Now correctly preserves 8-bit extended ASCII characters (128-255) like é, ñ, ü, etc. - Smarter Unicode Handling: Only converts problematic Unicode characters while preserving intentional extended ASCII usage - Updated Dependencies: Replaced Unidecode dependency with ftfy in requirements.txt - Maintains AI Artifact Removal: Still removes smart quotes, EM/EN dashes, and other "AI tells" as designed
1 parent 4e31d08 commit 1982bba

File tree

3 files changed

+28
-11
lines changed

3 files changed

+28
-11
lines changed

CHANGELOG.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
11
# Changelog for UnicodeFix
22

3+
## 2025-07-28
4+
5+
### **Extended ASCII Preservation Fix**
6+
7+
- **Switched from Unidecode to ftfy:** Replaced aggressive Unicode-to-ASCII conversion with intelligent text fixing
8+
- **Preserves Extended ASCII:** Now correctly preserves 8-bit extended ASCII characters (128-255) like é, ñ, ü, etc.
9+
- **Smarter Unicode Handling:** Only converts problematic Unicode characters while preserving intentional extended ASCII usage
10+
- **Updated Dependencies:** Replaced `Unidecode` dependency with `ftfy` in requirements.txt
11+
- **Maintains AI Artifact Removal:** Still removes smart quotes, EM/EN dashes, and other "AI tells" as designed
12+
313
## 2025-07-23
414

5-
**Test Suite Fixes & Validation**
15+
### **Test Suite Fixes & Validation**
616

717
- **Fixed Default Scenario:** Corrected test script to properly handle default behavior (creates `.clean.ext` files) without using `-o` flag that caused errors with multiple files
818
- **Cascading File Prevention:** Added filtering to prevent processing already-cleaned `.clean.ext` files in subsequent test runs
@@ -19,7 +29,7 @@
1929

2030
## 2025-07-22
2131

22-
**Major Release - "Enough of Your AI Nonsense" Edition**
32+
### **Major Release - "Enough of Your AI Nonsense" Edition**
2333

2434
- **CLI Supercharged:** Added new power flags:
2535
`-i` / `--invisible` (preserve zero-width/invisible Unicode)

bin/cleanup-text.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@
2222
import re
2323
import sys
2424

25-
# Check for unidecode dependency early, with a clear message if missing
25+
# Check for ftfy dependency early, with a clear message if missing
2626
try:
27-
from unidecode import unidecode # noqa: F401
27+
import ftfy
2828
except ImportError:
2929
print(
30-
"[✗] Missing dependency: 'Unidecode'. Please install it with:\n"
31-
" pip install Unidecode\n"
30+
"[✗] Missing dependency: 'ftfy'. Please install it with:\n"
31+
" pip install ftfy\n"
3232
"Or install all requirements with:\n"
3333
" pip install -r requirements.txt",
3434
file=sys.stderr
@@ -63,6 +63,10 @@ def clean_text(text: str, preserve_invisible: bool = False) -> str:
6363
Returns:
6464
str: The cleaned text with normalized ASCII characters
6565
"""
66+
# Use ftfy for intelligent text fixing and normalization
67+
text = ftfy.fix_text(text)
68+
69+
# Handle specific cases that unidecode might not handle perfectly
6670
replacements = {
6771
'\u2018': "'", '\u2019': "'", # Smart single quotes
6872
'\u201C': '"', '\u201D': '"', # Smart double quotes
@@ -153,11 +157,14 @@ def main():
153157
# No files provided: filter mode (STDIN to STDOUT)
154158
raw = sys.stdin.read()
155159
cleaned = clean_text(raw, preserve_invisible=args.invisible)
156-
# Add or suppress newline at EOF based on -n/--no-newline
160+
161+
# Handle newline at EOF based on -n/--no-newline
157162
if not args.no_newline:
158-
cleaned = ensure_single_newline(cleaned)
159-
else:
160-
cleaned = cleaned.rstrip('\r\n')
163+
# Only add newline if there isn't one already
164+
if not cleaned.endswith('\n'):
165+
cleaned += '\n'
166+
# If --no-newline is specified, leave the file exactly as is (no changes to newlines)
167+
161168
sys.stdout.write(cleaned)
162169
return
163170

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Unidecode
1+
ftfy

0 commit comments

Comments
 (0)