This documentation is in need of reorganisation!
This page is the old ClientCookie documentation. It deals with operation on the level of urllib2 Handler objects, and also with adding headers, debugging, and cookie handling. Documentation for the higher-level browser-style interface is elsewhere.
import mechanize response = mechanize.urlopen("http://foo.bar.com/")
This function behaves identically to
that it deals with cookies automatically.
Here is a more complicated example, involving
(useful if you want to pass
Requests around, add headers to them,
import mechanize request = mechanize.Request("http://www.acme.com/") # note we're using the urlopen from mechanize, not urllib2 response = mechanize.urlopen(request) # let's say this next request requires a cookie that was set in response request2 = mechanize.Request("http://www.acme.com/flying_machines.html") response2 = mechanize.urlopen(request2) print response2.geturl() print response2.info() # headers print response2.read() # body (readline and readlines work too)
(The above example would also work with
mechanize.HTTPRequestUpgradeProcessor knows about
that class, but don't if you can avoid it, because this is an obscure hack for
compatibility purposes only).
In these examples, the workings are hidden inside the
mechanize.urlopen() function, which is an extension of
urllib2.urlopen(). Redirects, proxies and cookies are handled
automatically by this function (note that you may need a bit of configuration
to get your proxies correctly set up: see
Cookie processing (etc.) is handled by processor objects, which are an
HTTPRefererProcessor etc. They are used like any other handler.
There is quite a bit of other
urllib2-workalike code, too. Note:
This duplication has gone away in Python 2.4, since 2.4's
contains the processor extensions from mechanize, so you can simply use
mechanize's processor classes direct with 2.4's
mechanize's cookie functionality is included in Python 2.4 as module
There is also a
urlretrieve() function, which works like
An example at a slightly lower level shows how the module processes cookies more clearly:
# Don't copy this blindly! You probably want to follow the examples # above, not this one. import mechanize # Build an opener that *doesn't* automatically call .add_cookie_header() # and .extract_cookies(), so we can do it manually without interference. class NullCookieProcessor(mechanize.HTTPCookieProcessor): def http_request(self, request): return request def http_response(self, request, response): return response opener = mechanize.build_opener(NullCookieProcessor) request = mechanize.Request("http://www.acme.com/") response = mechanize.urlopen(request) cj = mechanize.CookieJar() cj.extract_cookies(response, request) # let's say this next request requires a cookie that was set in response request2 = mechanize.Request("http://www.acme.com/flying_machines.html") cj.add_cookie_header(request2) response2 = mechanize.urlopen(request2)
CookieJar class does all the work. There are essentially
.extract_cookies() extracts HTTP cookies from
Set-Cookie (the original Netscape cookie
Set-Cookie2 (RFC 2965) headers from a
response if and only if they should be set given the request, and
Cookie headers if and only
if they are appropriate for a particular HTTP request. Incoming cookies are
checked for acceptability based on the host name, etc. Cookies are only set on
outgoing requests if they match the request's host name, path, etc.
Note that if you're using
mechanize.urlopen() (or if
mechanize.HTTPCookieProcessor by some other
means), you don't need to call
.add_cookie_header() yourself. If, on the other hand,
you don't want to use
urllib2, you will need to use this pair of
methods. You can make your own
objects, which must support the interfaces described in the docstrings of
There are also some
CookieJar subclasses which can store
cookies in files and databases.
FileCookieJar is the abstract
CookieJars that can store cookies in disk files.
LWPCookieJar saves cookies in a format compatible with the
libwww-perl library. This class is convenient if you want to store cookies in
a human-readable file:
import mechanize cj = mechanize.LWPCookieJar() cj.revert("cookie3.txt") opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) r = opener.open("http://foobar.com/") cj.save("cookie3.txt")
.revert() method discards all existing cookies held by the
CookieJar (it won't lose any existing cookies if the load fails).
.load() method, on the other hand, adds the loaded cookies to
existing cookies held in the
CookieJar (old cookies are kept
unless overwritten by newly loaded ones).
MozillaCookieJar can load and save to the
'cookies.txt' format. This
format loses some information (unusual and nonstandard cookie attributes such
as comment, and also information specific to RFC 2965 cookies). The subclass
MSIECookieJar can load (but not save, yet) from Microsoft Internet
Explorer's cookie files (on Windows).
BSDDBCookieJar (NOT FULLY
TESTED!) saves to a BSDDB database using the standard library's
bsddb module. There's an unfinished
which uses (reads and writes) the Windows MSIE cookie database directly, rather
than storing copies of cookies as
MozillaCookieJar differs from
CookieJar only in storing cookies using a different,
Mozilla/Netscape-compatible, file format. The lynx browser also uses this
format. This file format can't store RFC 2965 cookies, so they are downgraded
to Netscape cookies on saving.
LWPCookieJar itself uses a
libwww-perl specific format (`Set-Cookie3') - see the example above. Python
and your browser should be able to share a cookies file (note that the file
location here will differ on non-unix OSes):
WARNING: you may want to backup your browser's cookies file
if you use
MozillaCookieJar to save cookies. I think it
works, but there have been bugs in the past!
import os, mechanize cookies = mechanize.MozillaCookieJar() cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt")) # see also the save and revert methods
Note that cookies saved while Mozilla is running will get clobbered by
Mozilla - see
MSIECookieJar does the same for Microsoft Internet Explorer
(MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this
format. In future, the Windows API calls might be used to load and save
(though the index has to be read directly, since there is no API for that,
AFAIK; there's also an unfinished
MSIEDBCookieJar, which uses
(reads and writes) the Windows MSIE cookie database directly, rather than
storing copies of cookies as
import mechanize cj = mechanize.MSIECookieJar(delayload=True) cj.load_from_registry() # finds cookie index file from registry
delayload argument speeds things up.
On Windows 9x (win 95, win 98, win ME), you need to supply a username to the
If you have no need to co-operate with a browser, the most convenient way to
save cookies on disk between sessions in human-readable form is to use
LWPCookieJar. This class uses a libwww-perl specific format
MozilliaCookieJar, this file format
doesn't lose information.
You might want to do this to use your
browser's cookies, to customize
CookieJar's behaviour by
passing constructor arguments, or to be able to get at the cookies it will hold
(for example, for saving cookies between sessions and for debugging).
If you're using the higher-level
urlopen(), etc), you'll have to let it know what
CookieJar it should use:
import mechanize cookies = mechanize.CookieJar() # build_opener() adds standard handlers (such as HTTPHandler and # HTTPCookieProcessor) by default. The cookie processor we supply # will replace the default one. opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) r = opener.open("http://acme.com/") # GET r = opener.open("http://acme.com/", data) # POST
urlopen() function uses a global
OpenerDirector instance to do its work, so if you want to use
urlopen() with your own
CookieJar, install the
OpenerDirector you built with
mechanize.install_opener() function, then proceed as usual:
mechanize.install_opener(opener) r = mechanize.urlopen("http://www.acme.com/")
You can set a policy object (must satisfy the interface defined by
mechanize.CookiePolicy), which determines which cookies are
allowed to be set and returned. Use the policy argument to the
CookieJar constructor, or use the .set_policy() method. The
default implementation has some useful switches:
from mechanize import CookieJar, DefaultCookiePolicy as Policy cookies = CookieJar() # turn on RFC 2965 cookies, be more strict about domains when setting and # returning Netscape cookies, and block some domains from setting cookies # or having them returned (read the DefaultCookiePolicy docstring for the # domain matching rules here) policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict, blocked_domains=["ads.net", ".ads.net"]) cookies.set_policy(policy)
These are implemented as processor classes. Processors are an extension of
urllib2's handlers (now a standard part of urllib2 in Python 2.4):
you just pass them to
build_opener() (example code below).
WWW Robots (also called wanderers or spiders) are programs that traverse
many pages in the World Wide Web by recursively retrieving linked pages. This
kind of program can place significant loads on web servers, so there is a standard for a
robots.txt file by which web site operators can request robots to keep
out of their site, or out of particular areas of it. This processor uses the
standard Python library's
robotparser module. It raises
mechanize.RobotExclusionError (subclass of
urllib2.HTTPError) if an attempt is made to open a URL prohibited
robots.txt. XXX ATM, this makes use of code in the
robotparser module that uses
urllib - this will
likely change in future to use
<META HTTP-EQUIV> tag is a way of including data
in HTML to be treated as if it were part of the HTTP headers. mechanize can
automatically read these tags and add the
HTTP-EQUIV headers to
the response object's real HTTP headers. The HTML is left unchanged.
Refresh HTTP header is a non-standard header which is
widely used. It requests that the user-agent follow a URL after a specified
time delay. mechanize can treat these headers (which may have been set in
<META HTTP-EQUIV> tags) as if they were 302 redirections.
Exactly when and how
Refresh headers are handled is configurable
using the constructor arguments.
Referer HTTP header lets the server know which URL
you've just visited. Some servers use this header as state information, and
don't like it if this is not present. It's a chore to add this header by hand
every time you make a request. This adds it automatically.
NOTE: this only makes sense if you use each processor for a
single chain of HTTP requests (so, for example, if you use a single
HTTPRefererProcessor to fetch a series of URLs extracted from a single page,
this will break). mechanize.Browser does this properly.
import mechanize cookies = mechanize.CookieJar() opener = mechanize.build_opener(mechanize.HTTPRefererProcessor, mechanize.HTTPEquivProcessor, mechanize.HTTPRefreshProcessor, ) opener.open("http://www.rhubarb.com/")
Response objects returned from (or raised as exceptions by)
.set_seekable_responses(True) has been called) and
import mechanize opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener() response = opener.open("http://example.com/") # same return value as .read(), but without affecting seek position total_nr_bytes = len(response.get_data()) assert len(response.read()) == total_nr_bytes assert len(response.read()) == 0 # we've already read the data response.seek(0) assert len(response.read()) == total_nr_bytes response.set_data("blah\n") assert response.get_data() == "blah\n" ...
This caching behaviour can be avoided by using
mechanize.OpenerDirector (as long as
HTTPResponseDebugProcessor are not used). It can also be avoided
import mechanize ua = mechanize.UserAgent() ua.set_seekable_responses(False) ua.set_handle_equiv(False) ua.set_debug_responses(False)
Note that if you turn on features that use seekable responses (currently:
HTTP-EQUIV handling and response body debug printing), returned responses
may be seekable as a side-effect of these features. However, this is
not guaranteed (currently, in these cases, returned response objects are
seekable, but raised respose objects —
instances — are not seekable). This applies regardless of whether you
If you explicitly request seekable responses by calling
.set_seekable_responses(True) on a
mechanize.UserAgent instance, or by using
mechanize.SeekableResponseOpener, which always return seekable
responses, then both returned and raised responses are guaranteed to be
Handlers should call
mechanize.seek_wrapped_response(response) if they require the
ResponseUpgradeProcessor) are deprecated since mechanize 0.1.6b.
The reason for the deprecation is that these were really abuses of the response
processing chain (the
.process_response() support documented by
urllib2). The response processing chain is sensibly used only for processing
response headers and data, not for processing response objects,
because the same data may occur as different Python objects (this can occur for
HTTPError is raised by
HTTPDefaultErrorHandler), but should only get processed once
mechanize automatically upgrades
urllib2.Request objects to
mechanize.Request, as a backwards-compatibility hack. This
means that you won't see any headers that are added to Request objects by
handlers unless you use
mechanize.Request in the first place.
Sorry about that.
Adding headers is done like so:
import mechanize, urllib2 req = urllib2.Request("http://foobar.com/") req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/") r = mechanize.urlopen(req)
You can also use the headers argument to the
urllib2 (in fact, mechanize takes over this task from
urllib2) adds some headers to
automatically - see the next section for details.
OpenerDirector automatically adds a
header to every
To change this and/or add similar headers, use your own
import mechanize cookies = mechanize.CookieJar() opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"), ("From", "email@example.com")]
Again, to use
urlopen(), install your
mechanize.install_opener(opener) r = mechanize.urlopen("http://acme.com/")
Also, a few standard headers (
Host) are added when the
Request is passed to
OpenerDirector.open()). You shouldn't need to change these
headers, but since this is done by
AbstractHTTPHandler, you can
change the way it works by passing a subclass of that handler to
build_opener() (or, as always, by constructing an opener yourself
and calling .add_handler()).
This section is only of interest for correct handling of third-party HTTP cookies. See below for an explanation of 'third-party'.
First, some terminology.
An unverifiable request (defined fully by RFC 2965) is one whose URL the user did not have the option to approve. For example, a transaction is unverifiable if the request is for an image in an HTML document, and the user had no option to approve the fetching of the image from a particular URL.
The request-host of the origin transaction (defined fully by RFC 2965) is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this is the request-host of the request for the page containing the image.
mechanize knows that redirected transactions are unverifiable, and will handle that on its own (ie. you don't need to think about the origin request-host or verifiability yourself).
If you want to initiate an unverifiable transaction yourself (which you should if, for example, you're downloading the images from a page, and 'the user' hasn't explicitly OKed those URLs):
request = Request(origin_req_host="www.example.com", unverifiable=True)
RFC 2965 handling is switched off by default, because few browsers implement it, so the RFC 2965 protocol is essentially never seen on the internet. To switch it on, see here.
First, a few common problems. The most frequent mistake people seem to make
is to use
mechanize.urlopen(), and the
on a cookie object themselves. If you use
OpenerDirector.open()), the module handles extraction and
adding of cookies by itself, so you should not call
Are you sure the server is sending you any cookies in the first place?
Maybe the server is keeping track of state in some other way
HIDDEN HTML form entries (possibly in a separate page referenced
by a frame), URL-encoded session keys, IP address, HTTP
headers)? Perhaps some embedded script in the HTML is setting cookies (see
below)? Maybe you messed up your request, and the server is sending you some
standard failure page (even if the page doesn't appear to indicate any
failure). Sometimes, a server wants particular headers set to the values it
expects, or it won't play nicely. The most frequent offenders here are the
Referer [sic] and / or
headers (see above for how to set these). The
User-Agent header may need to be set to a value like that of a
popular browser. The
Referer header may need to be set to the URL
that the server expects you to have followed a link from. Occasionally, it may
even be that operators deliberately configure a server to insist on precisely
the headers that the popular browsers (MS Internet Explorer, Mozilla/Netscape,
Opera, Konqueror/Safari) generate, but remember that incompetence (possibly on
your part) is more probable than deliberate sabotage (and if a site owner is
that keen to stop robots, you probably shouldn't be scraping it anyway).
.save() to or
.revert() from a file, single-session cookies
will expire unless you explicitly request otherwise with the
ignore_discard argument. This may be your problem if you find
cookies are going away after saving and loading.
import mechanize cj = mechanize.LWPCookieJar() opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) mechanize.install_opener(opener) r = mechanize.urlopen("http://foobar.com/") cj.save("/some/file", ignore_discard=True, ignore_expires=True)
If none of the advice above solves your problem quickly, try comparing the
headers and data that you are sending out with those that a browser emits.
Often this will give you the clue you need. Of course, you'll want to check
that the browser is able to do manually what you're trying to achieve
programatically before minutely examining the headers. Make sure that what you
do manually is exactly the same as what you're trying to do from
Python - you may simply be hitting a server bug that only gets revealed if you
view pages in a particular order, for example. In order to see what your
browser is sending to the server (even if HTTPS is in use), see the General FAQ page. If nothing is obviously wrong
with the requests your program is sending and you're out of ideas, you can try
the last resort of good old brute force binary-search debugging. Temporarily
switch to sending HTTP headers (with
httplib). Start by copying
Netscape/Mozilla or IE slavishly (apart from session IDs, etc., of course),
then begin the tedious process of mutating your headers and data until they
match what your higher-level code was sending. This will at least reliably
find your problem.
You can turn on display of HTTP headers:
import mechanize hh = mechanize.HTTPHandler() # you might want HTTPSHandler, too hh.set_http_debuglevel(1) opener = mechanize.build_opener(hh) response = opener.open(url)
Alternatively, you can examine your individual request and response
objects to see what's going on. Note, though, that mechanize upgrades
urllib2.Request objects to
mechanize.Request, so you
won't see any headers that are added to requests by handlers unless you use
mechanize.Request in the first place. In addition, requests may
involve "sub-requests" in cases such as redirection, in which case you will
also not see everything that's going on just by examining the original request
and final response. mechanize's responses can be made to
.get_data() methods. It's often
useful to use the
.get_data() method during debugging
HTTPRedirectDebugProcessor (which prints information
about redirections) and
HTTPResponseDebugProcessor (which prints
out all response bodies, including those that are read during redirections).
NOTE: as well as having these processors in your
OpenerDirector (for example, by passing them to
build_opener()) you have to turn on logging at the
INFO level or lower in order to see any output.
If you would like to see what is going on in mechanize's tiny mind, do this:
import sys, logging # logging.DEBUG covers masses of debugging information, # logging.INFO just shows the output from HTTPRedirectDebugProcessor, logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.DEBUG)
DEBUG level (as opposed to the
INFO level) can
actually be quite useful, as it explains why particular cookies are accepted or
rejected and why they are or are not returned.
One final thing to note is that there are some catch-all bare
except: statements in the module, which are there to handle
unexpected bad input without crashing your program. If this happens, it's a
bug in mechanize, so please mail me the warning text.
It is possible to embed script in HTML pages (sandwiched between
<SCRIPT>here</SCRIPT> tags, and in
A function named
str2time is provided by the package,
which may be useful for parsing dates in HTTP headers.
str2time is intended to be liberal, since HTTP date/time
formats are poorly standardised in practice. There is no need to use this
function in normal operations:
CookieJar instances keep track
of cookie lifetimes automatically. This function will stay around in some
form, though the supported date/time formats may change.
XXX Test me
import copy import mechanize class CommentCleanProcessor(mechanize.BaseProcessor): def http_response(self, request, response): if not hasattr(response, "seek"): response = mechanize.response_seek_wrapper(response) response.seek(0) new_response = copy.copy(response) new_response.set_data( re.sub("<!-([^-]*)->", "<!--\1-->", response.read())) return new_response https_response = http_response
The various cookie standards and their history form a case study of the terrible things that can happen to a protocol. The long-suffering David Kristol has written a paper about it, if you want to know the gory details.
Here is a summary.
The Netscape protocol (cookie_spec.html) is still the only standard supported by most browsers (including Internet Explorer and Netscape). Be aware that cookie_spec.html is not, and never was, actually followed to the letter (or anything close) by anyone (including Netscape, IE and mechanize): the Netscape protocol standard is really defined by the behaviour of Netscape (and now IE). Netscape cookies are also known as V0 cookies, to distinguish them from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a value of 1.
RFC 2109 was introduced
to fix some problems identified with the Netscape protocol, while still keeping
the same HTTP headers (
most prominent of these problems is the 'third-party' cookie issue, which was
an accidental feature of the Netscape protocol. When one visits www.bland.org,
one doesn't expect to get a cookie from www.lurid.com, a site one has never
visited. Depending on browser configuration, this can still happen, because
the unreconstructed Netscape protocol is happy to accept cookies from, say, an
image in a webpage (www.bland.org) that's included by linking to an
advertiser's server (www.lurid.com). This kind of event, where your browser
talks to a server that you haven't explicitly okayed by some means, is what the
RFCs call an 'unverifiable transaction'. In addition to the potential for
embarrassment caused by the presence of lurid.com's cookies on one's machine,
this may also be used to track your movements on the web, because advertising
agencies like doubleclick.net place ads on many sites. RFC 2109 tried to
change this by requiring cookies to be turned off during unverifiable
transactions with third-party servers - unless the user explicitly asks them to
be turned on. This clashed with the business model of advertisers like
doubleclick.net, who had started to take advantage of the third-party cookies
'bug'. Since the browser vendors were more interested in the advertisers'
concerns than those of the browser users, this arguably doomed both RFC 2109
and its successor, RFC 2965, from the start. Other problems than the
third-party cookie issue were also fixed by 2109. However, even ignoring the
advertising issue, 2109 was stillborn, because Internet Explorer and Netscape
behaved differently in response to its extended
headers. This was not really RFC 2109's fault: it worked the way it did to
keep compatibility with the Netscape protocol as implemented by Netscape.
Microsoft Internet Explorer (MSIE) was very new when the standard was designed,
but was starting to be very popular when the standard was finalised. XXX P3P,
and MSIE & Mozilla options
XXX Apparently MSIE implements bits of RFC 2109 - but not very compliant
(surprise). Presumably other browsers do too, as a result. mechanize
already does allow Netscape cookies to have
port cookie-attributes, and as far as I know that's the extent of
the support present in MSIE. I haven't tested, though!
RFC 2965 attempted to fix
the compatibility problem by introducing two new headers,
Cookie2. Unlike the
Cookie2 does not carry
cookies to the server - rather, it simply advertises to the server that RFC
2965 is understood.
Set-Cookie2 does carry cookies, from
server to client: the new header means that both IE and Netscape completely
ignore these cookies. This prevents breakage, but introduces a chicken-egg
problem that means 2965 may never be widely adopted, especially since Microsoft
shows no interest in it. XXX Rumour has it that the European Union is unhappy
with P3P, and might introduce legislation that requires something better,
forming a gap that RFC 2965 might fill - any truth in this? Opera is the only
browser I know of that supports the standard. On the server side, Apache's
mod_usertrack supports it. One confusing point to note about RFC
2965 is that it uses the same value (1) of the Version attribute in HTTP
headers as does RFC 2109.
Most recently, it was discovered that RFC 2965 does not fully take account of issues arising when 2965 and Netscape cookies coexist, and errata were discussed on the W3C http-state mailing list, but the list traffic died and it seems RFC 2965 is dead as an internet protocol (but still a useful basis for implementing the de-facto standards, and perhaps as an intranet protocol).
Because Netscape cookies are so poorly specified, the general philosophy of the module's Netscape cookie implementation is to start with RFC 2965 and open holes where required for Netscape protocol-compatibility. RFC 2965 cookies are always treated as RFC 2965 requires, of course!
Cookie, do this?
No: Cookie.py does the server end of the job. It doesn't know when to accept cookies from a server or when to pass them back.
No. You probably want it, though.
There is more than one protocol, in fact (see the docs for a brief explanation of the history):
Netscape and RFC 2965. RFC 2965 handling is switched off by default.
RFC 2109 cookies are currently parsed as Netscape cookies, and treated
by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled,
or as Netscape cookies otherwise. RFC 2109 is officially obsoleted by RFC
2965. Browsers do use a few RFC 2109 features in their Netscape cookie
mechanize knows about that, too.
Read the debugging section of this page.
Did you call
response.read() (eg., in a debug statement),
then forget that all the data has already been read? In that case, you
may want to use
.readline() methods on your
response object as many times as you need. The
method (which is not always present, see above) still works, because mechanize
caches read data.
.load() appends cookies from a file.
.revert() discards all existing cookies held by the
CookieJar first (but it won't lose any existing cookies if
the loading fails).
No. Tested patches welcome. Clarification: As far as I know, it's perfectly possible to use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself.
The module docstrings are worth reading if you want to do something unusual.
urllib2used "handlers", but not these "processors".
This Python library patch contains an explanation. Processors are now a standard part of urllib2 in Python 2.4.
from mechanize import CookieJar print CookieJar.extract_cookies.__doc__ print CookieJar.add_cookie_header.__doc__
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, December 2008.