Skip to content

Commit 69a11c5

Browse files
committed
Updated script and README.md
1 parent c051916 commit 69a11c5

File tree

2 files changed

+19
-2
lines changed

2 files changed

+19
-2
lines changed

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,15 @@ options:
4040
python bin/cleanup-text.py <input_file>
4141
```
4242

43+
## What's in this repo:
44+
45+
- [bin/cleanup-text.py](bin/cleanup-text.py) - The script that cleans up the text.
46+
- [setup.sh](setup.sh) - A script that sets up the environment to run the script.
47+
- [LICENSE](LICENSE) - The license for the project.
48+
- [README.md](README.md) - This file.
49+
- [requirements.txt](requirements.txt) - The dependencies for the project.
50+
- [data/](data/) - A directory with sample files full of unicode to test with.
51+
4352
## Coming SOon
4453
- macSO Shortcut
4554

bin/cleanup-text.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
11
#!/usr/bin/env python3
2-
import sys
2+
#
33
import argparse
44
import re
5+
import sys
6+
57
from unidecode import unidecode
68

9+
710
def clean_text(text):
11+
"""
12+
Clean Unicode quirks from text.
13+
"""
814
replacements = {
915
'\u2018': "'", '\u2019': "'",
1016
'\u201C': '"', '\u201D': '"',
@@ -15,6 +21,7 @@ def clean_text(text):
1521
text = re.sub(r'[\u200B\u200C\u200D\uFEFF]', '', text)
1622
return unidecode(text)
1723

24+
1825
def main():
1926
parser = argparse.ArgumentParser(description="Clean Unicode quirks from text.")
2027
parser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin,
@@ -27,5 +34,6 @@ def main():
2734
cleaned = clean_text(input_text)
2835
args.output.write(cleaned)
2936

37+
3038
if __name__ == '__main__':
31-
main()
39+
main()

0 commit comments

Comments
 (0)