SourceForge.net Logo

pullparser

This module is unmaintained (now part of mechanize, but interface no longer public).

A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser.

Examples:

This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the <a>...</a> tags:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)

This program extracts the <title> from the document:

import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title

Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

All documentation (including this web page) is included in the distribution.

Stable release.

For installation instructions, see the INSTALL file included in the distribution.

Subversion

The Subversion (SVN) trunk is http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source:

svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser

See also

Beautiful Soup is widely recommended. More robust than this module.

I recommend Beautiful Soup over pullparser for new web scraping code. More robust and flexible than this module.

FAQs

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, May 2006.