- Home
- Download
- Support
- Development
mechanize — FAQ
-
Which version of Python do I need?
Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported.
-
Does mechanize depend on BeautifulSoup?
No. mechanize offers a few classes that make use of BeautifulSoup, but these classes are not required to use mechanize. mechanize bundles BeautifulSoup version 2, so that module is no longer required. A future version of mechanize will support BeautifulSoup version 3, at which point mechanize will likely no longer bundle the module.
-
Does mechanize depend on ClientForm?
No, ClientForm is now part of mechanize.
-
Which license?
mechanize is dual-licensed: you may pick either the BSD license, or the ZPL 2.1 (both are included in the distribution).
Usage
-
I’m not getting the HTML page I expected to see.
-
Browser
doesn’t have all of the forms/links I see in the HTML. Why not?Perhaps the default parser can’t cope with invalid HTML. Try using the included BeautifulSoup 2 parser instead:
import mechanize
browser = mechanize.Browser(factory=mechanize.RobustFactory())
browser.open("http://example.com/")
print browser.forms
Alternatively, you can process the HTML (and headers) arbitrarily:
browser = mechanize.Browser()
browser.open("http://example.com/")
html = browser.response().get_data().replace("<br/>", "<br />")
response = mechanize.make_response(
html, [("Content-Type", "text/html")],
"http://example.com/", 200, "OK")
browser.set_response(response)
-
Is JavaScript supported?
-
My HTTP response data is truncated.
mechanize.Browser's
response objects support the.seek()
method, and can still be used after.close()
has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Callresponse.get_data()
before navigation if you don’t want that to happen. I’m sure this page is HTML, why does
mechanize.Browser
think otherwise?
b = mechanize.Browser(
# mechanize's XHTML support needs work, so is currently switched off. If
# we want to get our work done, we have to turn it on by supplying a
# mechanize.Factory (with XHTML support turned on):
factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True)
)
-
Why don’t timeouts work for me?
Timeouts are ignored with with versions of Python earlier than 2.6. Timeouts do not apply to DNS lookups.
-
Is there any example code?
Look in the
examples/
directory. Note that the examples on the forms page are executable as-is. Contributions of example code would be very welcome!
Cookies
-
Doesn’t the standard Python library module,
Cookie
, do this?No: module
Cookie
does the server end of the job. It doesn’t know when to accept cookies from a server or when to send them back. Part of mechanize has been contributed back to the standard library as modulecookielib
(there are a few differences, notably thatcookielib
contains thread synchronization code; mechanize does not usecookielib
). -
Which HTTP cookie protocols does mechanize support?
Netscape and RFC 2965. RFC 2965 handling is switched off by default.
-
What about RFC 2109?
RFC 2109 cookies are currently parsed as Netscape cookies, and treated by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, or as Netscape cookies otherwise.
-
Why don’t I have any cookies?
See here.
-
My response claims to be empty, but I know it’s not!
Did you call
response.read()
(e.g., in a debug statement), then forget that all the data has already been read? In that case, you may want to usemechanize.response_seek_wrapper
.mechanize.Browser
always returns seekable responses, so it’s not necessary to use this explicitly in that case. -
What’s the difference between the
.load()
and.revert()
methods ofCookieJar
?.load()
appends cookies from a file..revert()
discards all existing cookies held by theCookieJar
first (but it won’t lose any existing cookies if the loading fails). -
Is it threadsafe?
No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself.
-
How do I do <X>
Refer to the API documentation in docstrings.
Forms
-
Doesn’t the standard Python library module,
cgi
, do this?No: the
cgi
module does the server end of the job. It doesn’t know how to parse or fill in a form or how to send it back to the server. -
How do I figure out what control names and values to use?
print form
is usually all you need. In your code, things like theHTMLForm.items
attribute ofHTMLForm
instances can be useful to inspect forms at runtime. Note that it’s possible to use item labels instead of item names, which can be useful — use theby_label
arguments to the various methods, and the.get_value_by_label()
/.set_value_by_label()
methods onListControl
. -
What do those
'*'
characters mean in the string representations of list controls?A
*
next to an item means that item is selected. -
What do those parentheses (round brackets) mean in the string representations of list controls?
Parentheses
(foo)
around an item mean that item is disabled. -
Why doesn’t <some control> turn up in the data returned by
.click*()
when that control has non-None
value?Either the control is disabled, or it is not successful for some other reason. ‘Successful’ (see HTML 4 specification) means that the control will cause data to get sent to the server.
-
Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for
RADIO
and multiple-selectionSELECT
controls?Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Use the
select_default
argument toParseResponse
if you want to follow the RFC 1866 rules instead. Note that browser behaviour violates the HTML 4.01 specification in the case ofRADIO
controls. -
Why does
.click()
ing on a button not work for me?Clicking on a
RESET
button doesn’t do anything, by design - this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking onRESET
sends nothing to the server, so there is little point in having.click()
do anything special here.Clicking on a
BUTTON TYPE=BUTTON
doesn’t do anything either, also by design. This time, the reason is that thatBUTTON
is only in the HTML standard so that one can attach JavaScript callbacks to its events. Their execution may result in information getting sent back to the server. mechanize, however, knows nothing about these callbacks, so it can’t do anything useful with a click on aBUTTON
whose type isBUTTON
.Generally, JavaScript may be messing things up in all kinds of ways. See the answer to the next question.
-
How do I change
INPUT TYPE=HIDDEN
field values (for example, to emulate the effect of JavaScript code)?As with any control, set the control’s
readonly
attribute false.
form.find_control("foo").readonly = False # allow changing .value of control foo
form.set_all_readonly(False) # allow changing the .value of all controls
-
I’m having trouble debugging my code.
See here for few relevant tips.
I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want?
import bisect
def closest_int_value(form, ctrl_name, value):
values = map(int, [item.name for item in form.find_control(ctrl_name).items])
return str(values[bisect.bisect(values, value) - 1])
form["distance"] = [closest_int_value(form, "distance", 23)]
General
-
I want to see what my web browser is doing, but standard network sniffers like wireshark or netcat (nc) don’t work for HTTPS. How do I sniff HTTPS traffic?
Three good options:
Mozilla plugin: LiveHTTPHeaders.
ieHTTPHeaders does the same for MSIE.
Use
lynx
-trace
, and filter out the junk with a script.
-
JavaScript is messing up my web-scraping. What do I do?
JavaScript is used in web pages for many purposes — for example: creating content that was not present in the page at load time, submitting or filling in parts of forms in response to user actions, setting cookies, etc. mechanize does not provide any support for JavaScript.
If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity.
Figure out what the JavaScript is doing and emulate it in your Python code: for example, by manually adding cookies to your
CookieJar
instance, calling methods onHTMLForm
s, callingurlopen
, etc. See above re forms.Use Java’s HtmlUnit or HttpUnit from Jython, since they know some JavaScript.
Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions, aka pywin32, aka win32all (e.g. simple function, pamie; pywin32 chapter from the O’Reilly book) or ctypes (example). This kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is PyXPCOM.
Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla’s JavaScript interpreter, for instance). This is what HtmlUnit and httpunit do. I did a spike along these lines some years ago, but I think it would (still) be quite a lot of work to do well.
-
Misc links
The following libraries can be useful for dealing with bad HTML: lxml.html, html5lib, BeautifulSoup 3, mxTidy and mu-Tidylib.
Selenium: In-browser web functional testing. If you need to test websites against real browsers, this is a standard way to do it.
O’Reilly book: Spidering Hacks. Very Perl-oriented.
Standard extensions for web development with Firefox, which are also handy if you’re scraping the web: Web Developer (amongst other things, this can display HTML form information), Firebug.
Similar functionality for IE6 and IE7: Internet Explorer Developer Toolbar (IE8 comes with something equivalent built-in, as does Google Chrome).
A HOWTO on web scraping from Dave Kuhlman.
-
Will any of this code make its way into the Python standard library?
The request / response processing extensions to
urllib2
from mechanize have been merged intourllib2
for Python 2.4. The cookie processing has been added, as modulecookielib
. There are other features that would be appropriate additions tourllib2
, but since Python 2 is heading into bugfix-only mode, and I’m not using Python 3, they’re unlikely to be added. -
Where can I find out about the relevant standards?
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, October 2010.