[wwwsearch-commits] r44313 - in wwwsearch/mechanize/trunk: mechanize test

jjlee at codespeak.net jjlee at codespeak.net
Sat Jun 16 23:49:03 CEST 2007


Author: jjlee
Date: Sat Jun 16 23:48:59 2007
New Revision: 44313

Modified:
   wwwsearch/mechanize/trunk/mechanize/_html.py
   wwwsearch/mechanize/trunk/test/test_html.doctest
Log:
Fix BeautifulSoup RobustLinksFactory (hence RobustFactory) link text parsing for case of link text containing tags (titus at idyll.org)

Modified: wwwsearch/mechanize/trunk/mechanize/_html.py
==============================================================================
--- wwwsearch/mechanize/trunk/mechanize/_html.py	(original)
+++ wwwsearch/mechanize/trunk/mechanize/_html.py	Sat Jun 16 23:48:59 2007
@@ -379,15 +379,15 @@
                 if not url:
                     continue
                 url = _rfc3986.clean_url(url, encoding)
-                text = link.firstText(lambda t: True)
-                if text is _beautifulsoup.Null:
+                text = link.fetchText(lambda t: True)
+                if not text:
                     # follow _pullparser's weird behaviour rigidly
                     if link.name == "a":
                         text = ""
                     else:
                         text = None
                 else:
-                    text = self.compress_re.sub(" ", text.strip())
+                    text = self.compress_re.sub(" ", " ".join(text).strip())
                 yield Link(base_url, url, text, link.name, attrs)
 
 

Modified: wwwsearch/mechanize/trunk/test/test_html.doctest
==============================================================================
--- wwwsearch/mechanize/trunk/test/test_html.doctest	(original)
+++ wwwsearch/mechanize/trunk/test/test_html.doctest	Sat Jun 16 23:48:59 2007
@@ -161,3 +161,55 @@
 Traceback (most recent call last):
 ...
 StopIteration
+
+
+Link text parsing
+
+>>> def get_first_link_text_bs(html):
+...     factory = RobustLinksFactory()
+...     soup = MechanizeBs("utf-8", html)
+...     factory.set_soup(soup, "http://example.com/", "utf-8")
+...     return list(factory.links())[0].text
+
+>>> def get_first_link_text_sgmllib(html):
+...     factory = LinksFactory()
+...     response = test_html_response(html)
+...     factory.set_response(response, "http://example.com/", "utf-8")
+...     return list(factory.links())[0].text
+
+Whitespace gets compressed down to single spaces.  Tags are removed.
+
+>>> html = ("""\
+... <html><head><title>Title</title></head><body>
+... <p><a href="http://example.com/">The  quick\tbrown fox jumps
+...   over the <i><b>lazy</b></i> dog </a>
+... </body></html>
+... """)
+>>> get_first_link_text_bs(html)
+'The quick brown fox jumps over the lazy dog'
+>>> get_first_link_text_sgmllib(html)
+'The quick brown fox jumps over the lazy dog'
+
+Empty <a> links have empty link text
+
+>>> html = ("""\
+... <html><head><title>Title</title></head><body>
+... <p><a href="http://example.com/"></a>
+... </body></html>
+... """)
+>>> get_first_link_text_bs(html)
+''
+>>> get_first_link_text_sgmllib(html)
+''
+
+But for backwards-compatibility, empty non-<a> links have None link text
+
+>>> html = ("""\
+... <html><head><title>Title</title></head><body>
+... <p><frame src="http://example.com/"></frame>
+... </body></html>
+... """)
+>>> print get_first_link_text_bs(html)
+None
+>>> print get_first_link_text_sgmllib(html)
+None


More information about the wwwsearch-commits mailing list