From f2d780edd52d6a83c6222df7eebfaa1dd54b55b1 Mon Sep 17 00:00:00 2001 From: Mikhail Korobov Date: Thu, 3 Oct 2013 04:25:56 +0600 Subject: [PATCH 1/2] scrapely.tool: add support for non-ascii and arguments MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit and arguments are parsed by parse_criteria function (it uses shlex and optparse for parsing). Data that is passed to parse_criteria function is extracted from "line" argument of do_<…> methods. This "line" argument is read from self.stdin by cmd.Cmd and passed to do_ methods. In Python 2.x sys.stdin (which is default for cmd.Cmd.stdin) is binary, so "line" is a bytestring; its encoding is self.stdin.encoding. That's why and argument values was previously bytestrings; when passed to other scrapely functions they eventually got implicitly decoded using sys.getdefaultencoding() - this usually leads to UnicodeDecodeError if input text is non-ascii. The fix is to decode these arguments using self.stdin.encoding before passing them to scrapely. This is done after shlex call because shlex doesn't support unicode. Non-ascii "field" arguments are still unsupported. --- scrapely/tool.py | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/scrapely/tool.py b/scrapely/tool.py index 172c695..6a5b23e 100644 --- a/scrapely/tool.py +++ b/scrapely/tool.py @@ -43,7 +43,7 @@ def do_t(self, line): """t