Skip to content

Commit 603440e

Browse files
author
Leif Arne Storset
committed
Replace invalid characters with U+FFFD (fixes #96)
1 parent f5fd711 commit 603440e

File tree

3 files changed

+4
-0
lines changed

3 files changed

+4
-0
lines changed

AUTHORS.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ Patches and suggestions
3232
- Juan Carlos Garcia Segovia
3333
- Mike West
3434
- Marc DM
35+
- Leif Arne Storset

CHANGES.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ Change Log
77
Released on XXX, 2014
88

99
* XXX
10+
* Fix #96: replace invalid characters from "Preprocessing the input stream" with
11+
U+FFFD, preventing problems in lxml.
1012

1113

1214
0.999

html5lib/inputstream.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,7 @@ def readChunk(self, chunkSize=None):
270270
# Replace invalid characters
271271
# Note U+0000 is dealt with in the tokenizer
272272
data = self.replaceCharactersRegexp.sub("\ufffd", data)
273+
data = invalid_unicode_re.sub("\ufffd", data)
273274

274275
data = data.replace("\r\n", "\n")
275276
data = data.replace("\r", "\n")

0 commit comments

Comments
 (0)