@# This file is processed by EmPy to colorize Python source code @# http://wwwsearch.sf.net/bits/colorize.py @{ from colorize import colorize import time import release last_modified = release.svn_id_to_time("$Id$") try: base except NameError: base = False } pullparser @[if base]@[end if]
SourceForge.net Logo

pullparser

This module is currently unmaintained (now part of mechanize, but interface no longer public).

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.

Examples:

This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags: @{colorize(r""" import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text) """)}

This program extracts the <title> from the document: @{colorize(r""" import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title """)}

Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

All documentation (including this web page) is included in the distribution.

Stable release.

For installation instructions, see the INSTALL file included in the distribution.

Subversion

The Subversion (SVN) trunk is http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source:

svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser

See also

Beautiful Soup is widely recommended. More robust than this module.

I recommend Beautiful Soup over pullparser for new web scraping code. More robust and flexible than this module.

FAQs

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, @(time.strftime("%B %Y", last_modified)).