This module is unmaintained (now part of mechanize, but interface no longer public).
A simple "pull API" for HTML parsing, after Perl's
HTML::TokeParser. Many simple HTML parsing tasks are
simpler this way than with the
pullparser.PullParser is a subclass of
This program extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
import pullparser, sys f = file(sys.argv) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text)
This program extracts the
<title> from the document:
import pullparser, sys f = file(sys.argv) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title
All documentation (including this web page) is included in the distribution.
For installation instructions, see the INSTALL file included in the distribution.
The Subversion (SVN) trunk is http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source:
svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser
Beautiful Soup is widely recommended. More robust than this module.
I recommend Beautiful Soup over pullparser for new web scraping code. More robust and flexible than this module.
2.2.1 or above.
HTMLParser is fussy. Try
pullparser.TolerantPullParser instead, which uses module
sgmllib instead. Note that self-closing tags (<foo/>)
will show up as 'starttag' tags, not 'startendtag' tags if you use this
class - this is a limitation of module
HTMLParser.HTMLParserisn't very robust. Would be fairly easy to (perhaps optionally) rebase on the other standard library HTML parsing module,
sgmllib.SGMLParser(which is really an HTML parser, not a full SGML parser, despite the name). I'm not going to do that, though.
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, May 2006.