You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scrapely.tool: add support for non-ascii <text> and <data> arguments
<text> and <data> arguments are parsed by parse_criteria function
(it uses shlex and optparse for parsing). Data that is passed to parse_criteria
function is extracted from "line" argument of do_<…> methods.
This "line" argument is read from self.stdin by cmd.Cmd and
passed to do_ methods. In Python 2.x sys.stdin (which is
default for cmd.Cmd.stdin) is binary, so "line" is a bytestring;
its encoding is self.stdin.encoding. That's why <text> and <data>
argument values was previously bytestrings; when passed to
other scrapely functions they eventually got implicitly decoded
using sys.getdefaultencoding() - this usually leads to
UnicodeDecodeError if input text is non-ascii.
The fix is to decode these arguments using self.stdin.encoding
before passing them to scrapely. This is done after shlex call
because shlex doesn't support unicode. Non-ascii "field" arguments
are still unsupported.
0 commit comments