mechanize — FAQ
Which version of Python do I need?
Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported.
Does mechanize depend on BeautifulSoup?
No. mechanize offers a few classes that make use of BeautifulSoup, but these classes are not required to use mechanize. mechanize bundles BeautifulSoup version 2, so that module is no longer required. A future version of mechanize will support BeautifulSoup version 3, at which point mechanize will likely no longer bundle the module.
Does mechanize depend on ClientForm?
No, ClientForm is now part of mechanize.
I’m not getting the HTML page I expected to see.
Browserdoesn’t have all of the forms/links I see in the HTML. Why not?
Perhaps the default parser can’t cope with invalid HTML. Try using the included BeautifulSoup 2 parser instead:
browser = mechanize.Browser(factory=mechanize.RobustFactory())
Alternatively, you can process the HTML (and headers) arbitrarily:
browser = mechanize.Browser()
html = browser.response().get_data().replace("<br/>", "<br />")
response = mechanize.make_response(
html, [("Content-Type", "text/html")],
"http://example.com/", 200, "OK")
My HTTP response data is truncated.
mechanize.Browser'sresponse objects support the
.seek()method, and can still be used after
.close()has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Call
response.get_data()before navigation if you don’t want that to happen.
b = mechanize.Browser(
# mechanize's XHTML support needs work, so is currently switched off. If
# we want to get our work done, we have to turn it on by supplying a
# mechanize.Factory (with XHTML support turned on):
Why don’t timeouts work for me?
Timeouts are ignored with with versions of Python earlier than 2.6. Timeouts do not apply to DNS lookups.
Is there any example code?
Look in the
examples/directory. Note that the examples on the forms page are executable as-is. Contributions of example code would be very welcome!
Doesn’t the standard Python library module,
cgi, do this?
cgimodule does the server end of the job. It doesn’t know how to parse or fill in a form or how to send it back to the server.
How do I figure out what control names and values to use?
print formis usually all you need. In your code, things like the
HTMLForminstances can be useful to inspect forms at runtime. Note that it’s possible to use item labels instead of item names, which can be useful — use the
by_labelarguments to the various methods, and the
What do those
'*'characters mean in the string representations of list controls?
*next to an item means that item is selected.
What do those parentheses (round brackets) mean in the string representations of list controls?
(foo)around an item mean that item is disabled.
Why doesn’t <some control> turn up in the data returned by
.click*()when that control has non-
Either the control is disabled, or it is not successful for some other reason. ‘Successful’ (see HTML 4 specification) means that the control will cause data to get sent to the server.
Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for
Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Use the
ParseResponseif you want to follow the RFC 1866 rules instead. Note that browser behaviour violates the HTML 4.01 specification in the case of
.click()ing on a button not work for me?
Clicking on a
RESETbutton doesn’t do anything, by design - this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking on
RESETsends nothing to the server, so there is little point in having
.click()do anything special here.
Clicking on a
BUTTON TYPE=BUTTONdoesn’t do anything either, also by design. This time, the reason is that that
BUTTONwhose type is
As with any control, set the control’s
form.find_control("foo").readonly = False # allow changing .value of control foo
form.set_all_readonly(False) # allow changing the .value of all controls
I’m having trouble debugging my code.
See here for few relevant tips.
I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want?
def closest_int_value(form, ctrl_name, value):
values = map(int, [item.name for item in form.find_control(ctrl_name).items])
return str(values[bisect.bisect(values, value) - 1])
form["distance"] = [closest_int_value(form, "distance", 23)]
I want to see what my web browser is doing, but standard network sniffers like wireshark or netcat (nc) don’t work for HTTPS. How do I sniff HTTPS traffic?
Three good options:
If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity.
CookieJarinstance, calling methods on
urlopen, etc. See above re forms.
Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions, aka pywin32, aka win32all (e.g. simple function, pamie; pywin32 chapter from the O’Reilly book) or ctypes (example). This kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is PyXPCOM.
Selenium: In-browser web functional testing. If you need to test websites against real browsers, this is a standard way to do it.
O’Reilly book: Spidering Hacks. Very Perl-oriented.
Similar functionality for IE6 and IE7: Internet Explorer Developer Toolbar (IE8 comes with something equivalent built-in, as does Google Chrome).
A HOWTO on web scraping from Dave Kuhlman.
Will any of this code make its way into the Python standard library?
The request / response processing extensions to
urllib2from mechanize have been merged into
urllib2for Python 2.4. The cookie processing has been added, as module
cookielib. There are other features that would be appropriate additions to
urllib2, but since Python 2 is heading into bugfix-only mode, and I’m not using Python 3, they’re unlikely to be added.
Where can I find out about the relevant standards?
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, October 2010.