-
Notifications
You must be signed in to change notification settings - Fork 119
Fix incorrectly formatted description property #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7e952b0
ee4da8b
2a7ae11
f380f2a
e463414
92d0646
4e2dbf9
48bb049
c02c349
86f47d2
c8dc164
f902dad
75686f1
0f5a632
c2b59b1
1e990b3
17c2982
670702e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,11 +18,32 @@ | |
| from urllib.parse import urljoin | ||
|
|
||
| import lxml.etree | ||
| from lxml.html.clean import Cleaner | ||
| from w3lib.html import strip_html5_whitespace | ||
| import html_text | ||
|
|
||
| from extruct.utils import parse_html | ||
|
|
||
|
|
||
| # Cleaner which is similar to html_text cleaner, but is less aggressive | ||
| cleaner = Cleaner( | ||
| scripts=True, | ||
| javascript=False, # onclick attributes are fine | ||
| comments=True, | ||
| style=True, | ||
| links=True, | ||
| meta=True, | ||
| page_structure=False, # <title> may be nice to have | ||
| processing_instructions=True, | ||
| embedded=False, # keep embedded content | ||
| frames=False, # keep frames | ||
| forms=False, # keep forms | ||
| annoying_tags=False, | ||
| remove_unknown_tags=False, | ||
| safe_attrs_only=False, | ||
| ) | ||
|
|
||
|
|
||
| class LxmlMicrodataExtractor(object): | ||
| _xp_item = lxml.etree.XPath('descendant-or-self::*[@itemscope]') | ||
| _xp_prop = lxml.etree.XPath("""set:difference(.//*[@itemprop], | ||
|
|
@@ -182,7 +203,8 @@ def _extract_property_value(self, node, items_seen, base_url, force=False): | |
| return self._extract_textContent(node) | ||
|
|
||
| def _extract_textContent(self, node): | ||
| return u"".join(self._xp_clean_text(node)).strip() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @lopuhin! Didn't notice that. Sure, will remove it 👍 |
||
| clean_node = cleaner.clean_html(node) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey! I'm concerned about performance implications of this; we're copying & cleaning a tree many times for each page. Could you please run it on a large sample of pages, to see how bad is an impact?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, it makes sense to do so - will check that 👍 |
||
| return html_text.etree_to_text(clean_node) | ||
|
|
||
|
|
||
| MicrodataExtractor = LxmlMicrodataExtractor | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,5 +7,6 @@ requests | |
| rdflib | ||
| rdflib-jsonld | ||
| mf2py>=1.1.0 | ||
| six | ||
| six>=1.11 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jakubwasikowski why did you updated six version? Maybe this is useful for @croqaz in #120
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I did exactly the same thing, but updated to 1.12 :D
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, forgot to mention it in description of PR. I had to select minimal version because six in version 1.10.0 causes errors for python3.4 when installing.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So when I added six>=1.12 it still crashes for Python 3.4.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For 1.11 it works, but for 1.12 it doesn't? Weird 🤦♂️ Especially taking into account that 1.12 is the latest version. |
||
| w3lib | ||
| html-text | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see, the difference between this cleaner and the html_text one is
embedded=Falseandframes=False. It would be nice to include this in the comment and the reasoning about why. I imagine the reason is that we want to include frames and iframes content as well, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ivanprado! This is because in the previous version the only removed tags were script and style, so probably removing the other ones will be too strict. I can imagine a situation with
<embed>like this:And similar thing applies to frames.