This module is unmaintained (now part of mechanize, but interface no longer public).

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.


This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)

This program extracts the <title> from the document:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title

Thanks to Gisle Aas, who wrote HTML::TokeParser.


All documentation (including this web page) is included in the distribution.

Stable release.

For installation instructions, see the INSTALL file included in the distribution.


The Subversion (SVN) trunk is, so to check out the source:

svn co pullparser

See also

Beautiful Soup is widely recommended. More robust than this module.

I recommend Beautiful Soup over pullparser for new web scraping code. More robust and flexible than this module.


I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, May 2006.