[wwwsearch-commits] r44313 - in wwwsearch/mechanize/trunk: mechanize test
jjlee at codespeak.net
jjlee at codespeak.net
Sat Jun 16 23:49:03 CEST 2007
Author: jjlee
Date: Sat Jun 16 23:48:59 2007
New Revision: 44313
Modified:
wwwsearch/mechanize/trunk/mechanize/_html.py
wwwsearch/mechanize/trunk/test/test_html.doctest
Log:
Fix BeautifulSoup RobustLinksFactory (hence RobustFactory) link text parsing for case of link text containing tags (titus at idyll.org)
Modified: wwwsearch/mechanize/trunk/mechanize/_html.py
==============================================================================
--- wwwsearch/mechanize/trunk/mechanize/_html.py (original)
+++ wwwsearch/mechanize/trunk/mechanize/_html.py Sat Jun 16 23:48:59 2007
@@ -379,15 +379,15 @@
if not url:
continue
url = _rfc3986.clean_url(url, encoding)
- text = link.firstText(lambda t: True)
- if text is _beautifulsoup.Null:
+ text = link.fetchText(lambda t: True)
+ if not text:
# follow _pullparser's weird behaviour rigidly
if link.name == "a":
text = ""
else:
text = None
else:
- text = self.compress_re.sub(" ", text.strip())
+ text = self.compress_re.sub(" ", " ".join(text).strip())
yield Link(base_url, url, text, link.name, attrs)
Modified: wwwsearch/mechanize/trunk/test/test_html.doctest
==============================================================================
--- wwwsearch/mechanize/trunk/test/test_html.doctest (original)
+++ wwwsearch/mechanize/trunk/test/test_html.doctest Sat Jun 16 23:48:59 2007
@@ -161,3 +161,55 @@
Traceback (most recent call last):
...
StopIteration
+
+
+Link text parsing
+
+>>> def get_first_link_text_bs(html):
+... factory = RobustLinksFactory()
+... soup = MechanizeBs("utf-8", html)
+... factory.set_soup(soup, "http://example.com/", "utf-8")
+... return list(factory.links())[0].text
+
+>>> def get_first_link_text_sgmllib(html):
+... factory = LinksFactory()
+... response = test_html_response(html)
+... factory.set_response(response, "http://example.com/", "utf-8")
+... return list(factory.links())[0].text
+
+Whitespace gets compressed down to single spaces. Tags are removed.
+
+>>> html = ("""\
+... <html><head><title>Title</title></head><body>
+... <p><a href="http://example.com/">The quick\tbrown fox jumps
+... over the <i><b>lazy</b></i> dog </a>
+... </body></html>
+... """)
+>>> get_first_link_text_bs(html)
+'The quick brown fox jumps over the lazy dog'
+>>> get_first_link_text_sgmllib(html)
+'The quick brown fox jumps over the lazy dog'
+
+Empty <a> links have empty link text
+
+>>> html = ("""\
+... <html><head><title>Title</title></head><body>
+... <p><a href="http://example.com/"></a>
+... </body></html>
+... """)
+>>> get_first_link_text_bs(html)
+''
+>>> get_first_link_text_sgmllib(html)
+''
+
+But for backwards-compatibility, empty non-<a> links have None link text
+
+>>> html = ("""\
+... <html><head><title>Title</title></head><body>
+... <p><frame src="http://example.com/"></frame>
+... </body></html>
+... """)
+>>> print get_first_link_text_bs(html)
+None
+>>> print get_first_link_text_sgmllib(html)
+None
More information about the wwwsearch-commits
mailing list