-
Notifications
You must be signed in to change notification settings - Fork 302
Description
In html5lib/inputstream.py, unicode_literals is imported from __future__. This causes html5lib.inputstream.BufferedStream to misbehave, specifically the _readFromBuffer method, which ends with return "".join(rv). Due to this being a unicode literal, any read from after the first becomes a chunk of unicode instead of a chunk of bytes.
An example of the problem caused:
from urllib2 import Request, urlopen
from html5lib.inputstream import HTMLBinaryInputStream
req = Request(url='http://example.org/')
source = urlopen(req)
HTMLBinaryInputStream(source)Causing:
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File ".../html5lib/inputstream.py", line 411, in __init__
self.charEncoding = self.detectEncoding(parseMeta, chardet)
File ".../html5lib/inputstream.py", line 448, in detectEncoding
encoding = self.detectEncodingMeta()
File ".../html5lib/inputstream.py", line 535, in detectEncodingMeta
assert isinstance(buffer, bytes)
AssertionError(That is, when HTMLBinaryInputStream is used with a file-like object (such as the result of urllib2.urlopen), it wraps it in a BufferedStream, which then fails (at line 535) with the assert isinstance(buffer, bytes).)
This can be fixed by using a byte literal in _readFromBuffer, instead, i.e. return b"".join(rv). (There are at least three places in inputstream.py where string literals are used like this: at lines 117, 318 and 348.)