mechanize — Documentation

Support
- Documentation
  - FAQ
  - Handlers etc.
  - Forms
  - Hints
- Changelog

Examples
Important note
Cooperating with Browsers
Saving cookies in a file
Supplying a CookieJar
Additional Handlers
Seekable responses
Request object lifetime
Adding headers
Automatically-added headers
Initiating unverifiable transactions
RFC 2965 support
Parsing HTTP dates
Dealing with bad HTML
Note about cookie standards

This documentation is in need of reorganisation!

This page is the old ClientCookie documentation. It deals with operation on the level of urllib2 Handler objects, and also with adding headers, debugging, and cookie handling. See the front page for more typical use.

Examples

import mechanize
response = mechanize.urlopen("http://example.com/")

This function behaves identically to urllib2.urlopen(), except that it deals with cookies automatically.

Here is a more complicated example, involving Request objects (useful if you want to pass Requests around, add headers to them, etc.):

import mechanize
request = mechanize.Request("http://example.com/")
# note we're using the urlopen from mechanize, not urllib2
response = mechanize.urlopen(request)
# let's say this next request requires a cookie that was set
# in response
request2 = mechanize.Request("http://example.com/spam.html")
response2 = mechanize.urlopen(request2)

print response2.geturl()
print response2.info()  # headers
print response2.read()  # body (readline and readlines work too)

In these examples, the workings are hidden inside the mechanize.urlopen() function, which is an extension of urllib2.urlopen(). Redirects, proxies and cookies are handled automatically by this function (note that you may need a bit of configuration to get your proxies correctly set up: see urllib2 documentation).

There is also a urlretrieve() function, which works like urllib.urlretrieve().

An example at a slightly lower level shows how the module processes cookies more clearly:

# Don't copy this blindly!  You probably want to follow the examples
# above, not this one.
import mechanize

# Build an opener that *doesn't* automatically call .add_cookie_header()
# and .extract_cookies(), so we can do it manually without interference.
class NullCookieProcessor(mechanize.HTTPCookieProcessor):
def http_request(self, request): return request
def http_response(self, request, response): return response
opener = mechanize.build_opener(NullCookieProcessor)

request = mechanize.Request("http://example.com/")
response = mechanize.urlopen(request)
cj = mechanize.CookieJar()
cj.extract_cookies(response, request)
# let's say this next request requires a cookie that was set in response
request2 = mechanize.Request("http://example.com/spam.html")
cj.add_cookie_header(request2)
response2 = mechanize.urlopen(request2)

The CookieJar class does all the work. There are essentially two operations: .extract_cookies() extracts HTTP cookies from Set-Cookie (the original Netscape cookie standard) and Set-Cookie2 (RFC 2965) headers from a response if and only if they should be set given the request, and .add_cookie_header() adds Cookie headers if and only if they are appropriate for a particular HTTP request. Incoming cookies are checked for acceptability based on the host name, etc. Cookies are only set on outgoing requests if they match the request’s host name, path, etc.

Note that if you’re using mechanize.urlopen() (or if you’re using mechanize.HTTPCookieProcessor by some other means), you don’t need to call .extract_cookies() or .add_cookie_header() yourself. If, on the other hand, you want to use mechanize to provide cookie handling for an HTTP client other than mechanize itself, you will need to use this pair of methods. You can make your own request and response objects, which must support the interfaces described in the docstrings of .extract_cookies() and .add_cookie_header().

There are also some CookieJar subclasses which can store cookies in files and databases. FileCookieJar is the abstract class for CookieJars that can store cookies in disk files. LWPCookieJar saves cookies in a format compatible with the libwww-perl library. This class is convenient if you want to store cookies in a human-readable file:

import mechanize
cj = mechanize.LWPCookieJar()
cj.revert("cookie3.txt")
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
r = opener.open("http://foobar.com/")
cj.save("cookie3.txt")

The .revert() method discards all existing cookies held by the CookieJar (it won’t lose any existing cookies if the load fails). The .load() method, on the other hand, adds the loaded cookies to existing cookies held in the CookieJar (old cookies are kept unless overwritten by newly loaded ones).

MozillaCookieJar can load and save to the Mozilla/Netscape/lynx-compatible 'cookies.txt' format. This format loses some information (unusual and nonstandard cookie attributes such as comment, and also information specific to RFC 2965 cookies). The subclass MSIECookieJar can load (but not save) from Microsoft Internet Explorer’s cookie files on Windows.

Important note

Only use names you can import directly from the mechanize package, and that don’t start with a single underscore. Everything else is subject to change or disappearance without notice.

Cooperating with Browsers

Firefox since version 3 persists cookies in an sqlite database, which is not supported by MozillaCookieJar.

The subclass MozillaCookieJar differs from CookieJar only in storing cookies using a different, Firefox 2/Mozilla/Netscape-compatible, file format known as “cookies.txt”. The lynx browser also uses this format. This file format can’t store RFC 2965 cookies, so they are downgraded to Netscape cookies on saving. LWPCookieJar itself uses a libwww-perl specific format (`Set-Cookie3’) — see the example above. Python and your browser should be able to share a cookies file (note that the file location here will differ on non-unix OSes):

WARNING: you may want to back up your browser’s cookies file if you use MozillaCookieJar to save cookies. I think it works, but there have been bugs in the past!

import os, mechanize
cookies = mechanize.MozillaCookieJar()
cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt"))
# see also the save and revert methods

Note that cookies saved while Mozilla is running will get clobbered by Mozilla — see MozillaCookieJar.__doc__.

MSIECookieJar does the same for Microsoft Internet Explorer (MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this format. In future, the Windows API calls might be used to load and save (though the index has to be read directly, since there is no API for that, AFAIK; there’s also an unfinished MSIEDBCookieJar, which uses (reads and writes) the Windows MSIE cookie database directly, rather than storing copies of cookies as MSIECookieJar does).

import mechanize
cj = mechanize.MSIECookieJar(delayload=True)
cj.load_from_registry()  # finds cookie index file from registry

A true delayload argument speeds things up.

On Windows 9x (win 95, win 98, win ME), you need to supply a username to the .load_from_registry() method:

cj.load_from_registry(username="jbloggs")

Konqueror/Safari and Opera use different file formats, which aren’t yet supported.

Saving cookies in a file

If you have no need to co-operate with a browser, the most convenient way to save cookies on disk between sessions in human-readable form is to use LWPCookieJar. This class uses a libwww-perl specific format (`Set-Cookie3’). Unlike MozilliaCookieJar, this file format doesn’t lose information.

Supplying a CookieJar

You might want to do this to use your browser’s cookies, to customize CookieJar’s behaviour by passing constructor arguments, or to be able to get at the cookies it will hold (for example, for saving cookies between sessions and for debugging).

If you’re using the higher-level urllib2-like interface (urlopen(), etc), you’ll have to let it know what CookieJar it should use:

import mechanize
cookies = mechanize.CookieJar()
# build_opener() adds standard handlers (such as HTTPHandler and
# HTTPCookieProcessor) by default.  The cookie processor we supply
# will replace the default one.
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

r = opener.open("http://example.com/")  # GET
r = opener.open("http://example.com/", data)  # POST

The urlopen() function uses a global OpenerDirector instance to do its work, so if you want to use urlopen() with your own CookieJar, install the OpenerDirector you built with build_opener() using the mechanize.install_opener() function, then proceed as usual:

mechanize.install_opener(opener)
r = mechanize.urlopen("http://example.com/")

Of course, everyone using urlopen is using the same global CookieJar instance!

You can set a policy object (must satisfy the interface defined by mechanize.CookiePolicy), which determines which cookies are allowed to be set and returned. Use the policy argument to the CookieJar constructor, or use the .set\_policy() method. The default implementation has some useful switches:

from mechanize import CookieJar, DefaultCookiePolicy as Policy
cookies = CookieJar()
# turn on RFC 2965 cookies, be more strict about domains when setting and
# returning Netscape cookies, and block some domains from setting cookies
# or having them returned (read the DefaultCookiePolicy docstring for the
# domain matching rules here)
policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict,
                blocked_domains=["ads.net", ".ads.net"])
cookies.set_policy(policy)

Additional Handlers

The following handlers are provided in addition to those provided by urllib2:

HTTPRobotRulesProcessor: WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. This kind of program can place significant loads on web servers, so there is a standard for a robots.txt file by which web site operators can request robots to keep out of their site, or out of particular areas of it. This handler uses the standard Python library’s robotparser module. It raises mechanize.RobotExclusionError (subclass of mechanize.HTTPError) if an attempt is made to open a URL prohibited by robots.txt.
HTTPEquivProcessor: The <META HTTP-EQUIV> tag is a way of including data in HTML to be treated as if it were part of the HTTP headers. mechanize can automatically read these tags and add the HTTP-EQUIV headers to the response object’s real HTTP headers. The HTML is left unchanged.
HTTPRefreshProcessor: The Refresh HTTP header is a non-standard header which is widely used. It requests that the user-agent follow a URL after a specified time delay. mechanize can treat these headers (which may have been set in <META HTTP-EQUIV> tags) as if they were 302 redirections. Exactly when and how Refresh headers are handled is configurable using the constructor arguments.
HTTPRefererProcessor: The Referer HTTP header lets the server know which URL you’ve just visited. Some servers use this header as state information, and don’t like it if this is not present. It’s a chore to add this header by hand every time you make a request. This adds it automatically. NOTE: this only makes sense if you use each handler for a single chain of HTTP requests (so, for example, if you use a single HTTPRefererProcessor to fetch a series of URLs extracted from a single page, this will break). mechanize.Browser does this properly.

Example:

import mechanize
cookies = mechanize.CookieJar()

opener = mechanize.build_opener(mechanize.HTTPRefererProcessor,
                                mechanize.HTTPEquivProcessor,
                                mechanize.HTTPRefreshProcessor,
                                )
opener.open("http://www.rhubarb.com/")

Seekable responses

Response objects returned from (or raised as exceptions by) mechanize.SeekableResponseOpener, mechanize.UserAgent (if .set_seekable_responses(True) has been called) and mechanize.Browser() have .seek(), .get_data() and .set_data() methods:

import mechanize
opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener()
response = opener.open("http://example.com/")
# same return value as .read(), but without affecting seek position
total_nr_bytes = len(response.get_data())
assert len(response.read()) == total_nr_bytes
assert len(response.read()) == 0  # we've already read the data
response.seek(0)
assert len(response.read()) == total_nr_bytes
response.set_data("blah\n")
assert response.get_data() == "blah\n"
...

This caching behaviour can be avoided by using mechanize.OpenerDirector. It can also be avoided with mechanize.UserAgent. Note that HTTPEquivProcessor and HTTPResponseDebugProcessor require seekable responses and so are not compatible with mechanize.OpenerDirector and mechanize.UserAgent.

import mechanize
ua = mechanize.UserAgent()
ua.set_seekable_responses(False)
ua.set_handle_equiv(False)
ua.set_debug_responses(False)

Note that if you turn on features that use seekable responses (currently: HTTP-EQUIV handling and response body debug printing), returned responses may be seekable as a side-effect of these features. However, this is not guaranteed (currently, in these cases, returned response objects are seekable, but raised respose objects — mechanize.HTTPError instances — are not seekable). This applies regardless of whether you use mechanize.UserAgent or mechanize.OpenerDirector. If you explicitly request seekable responses by calling .set_seekable_responses(True) on a mechanize.UserAgent instance, or by using mechanize.Browser or mechanize.SeekableResponseOpener, which always return seekable responses, then both returned and raised responses are guaranteed to be seekable.

Handlers should call response = mechanize.seek_wrapped_response(response) if they require the .seek(), .get_data() or .set_data() methods.

Request object lifetime

Note that handlers may create new Request instances (for example when performing redirects) rather than adding headers to existing Request objects.

Adding headers

Adding headers is done like so:

import mechanize
req = mechanize.Request("http://foobar.com/")
req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/")
r = mechanize.urlopen(req)

You can also use the headers argument to the mechanize.Request constructor.

mechanize adds some headers to Request objects automatically — see the next section for details.

Automatically-added headers

OpenerDirector automatically adds a User-Agent header to every Request.

To change this and/or add similar headers, use your own OpenerDirector:

import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
                     ("From", "responsible.person@example.com")]

Again, to use urlopen(), install your OpenerDirector globally:

mechanize.install_opener(opener)
r = mechanize.urlopen("http://example.com/")

Also, a few standard headers (Content-Length, Content-Type and Host) are added when the Request is passed to urlopen() (or OpenerDirector.open()). You shouldn’t need to change these headers, but since this is done by AbstractHTTPHandler, you can change the way it works by passing a subclass of that handler to build_opener() (or, as always, by constructing an opener yourself and calling .add_handler()).

Initiating unverifiable transactions

This section is only of interest for correct handling of third-party HTTP cookies. See below for an explanation of ‘third-party’.

First, some terminology.

An unverifiable request (defined fully by (RFC 2965) is one whose URL the user did not have the option to approve. For example, a transaction is unverifiable if the request is for an image in an HTML document, and the user had no option to approve the fetching of the image from a particular URL.

The request-host of the origin transaction (defined fully by RFC 2965) is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this is the request-host of the request for the page containing the image.

mechanize knows that redirected transactions are unverifiable, and will handle that on its own (ie. you don’t need to think about the origin request-host or verifiability yourself).

If you want to initiate an unverifiable transaction yourself (which you should if, for example, you’re downloading the images from a page, and ‘the user’ hasn’t explicitly OKed those URLs):

request = Request(origin_req_host="www.example.com", unverifiable=True)

RFC 2965 support

Support for the RFC 2965 protocol is switched off by default, because few browsers implement it, so the RFC 2965 protocol is essentially never seen on the internet. To switch it on, see here.

Parsing HTTP dates

A function named str2time is provided by the package, which may be useful for parsing dates in HTTP headers. str2time is intended to be liberal, since HTTP date/time formats are poorly standardised in practice. There is no need to use this function in normal operations: CookieJar instances keep track of cookie lifetimes automatically. This function will stay around in some form, though the supported date/time formats may change.

Dealing with bad HTML

XXX Intro

XXX Test me

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, March 2011.