SourceForge.net. Fast, secure and Free Open Source software downloads

mechanize — FAQ

  • Which version of Python do I need?

    Python 2.4, 2.5, 2.6, or 2.7. Python 3 is not yet supported.

  • Does mechanize depend on BeautifulSoup?

    No. mechanize offers a few classes that make use of BeautifulSoup, but these classes are not required to use mechanize. mechanize bundles BeautifulSoup version 2, so that module is no longer required. A future version of mechanize will support BeautifulSoup version 3, at which point mechanize will likely no longer bundle the module.

  • Does mechanize depend on ClientForm?

    No, ClientForm is now part of mechanize.

  • Which license?

    mechanize is dual-licensed: you may pick either the BSD license, or the ZPL 2.1 (both are included in the distribution).

Usage

  • I’m not getting the HTML page I expected to see.

    Debugging tips

  • Browser doesn’t have all of the forms/links I see in the HTML. Why not?

    Perhaps the default parser can’t cope with invalid HTML. Try using the included BeautifulSoup 2 parser instead:

import mechanize

browser = mechanize.Browser(factory=mechanize.RobustFactory())
browser.open("http://example.com/")
print browser.forms
Alternatively, you can process the HTML (and headers) arbitrarily:
browser = mechanize.Browser()
browser.open("http://example.com/")
html = browser.response().get_data().replace("<br/>", "<br />")
response = mechanize.make_response(
html, [("Content-Type", "text/html")],
"http://example.com/", 200, "OK")
browser.set_response(response)
  • Is JavaScript supported?

    No, sorry. See FAQs below.

  • My HTTP response data is truncated.

    mechanize.Browser's response objects support the .seek() method, and can still be used after .close() has been called. Response data is not fetched until it is needed, so navigation away from a URL before fetching all of the response will truncate it. Call response.get_data() before navigation if you don’t want that to happen.

  • I’m sure this page is HTML, why does mechanize.Browser think otherwise?

b = mechanize.Browser(
# mechanize's XHTML support needs work, so is currently switched off. If
# we want to get our work done, we have to turn it on by supplying a
# mechanize.Factory (with XHTML support turned on):
factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True)
)
  • Why don’t timeouts work for me?

    Timeouts are ignored with with versions of Python earlier than 2.6. Timeouts do not apply to DNS lookups.

  • Is there any example code?

    Look in the examples/ directory. Note that the examples on the forms page are executable as-is. Contributions of example code would be very welcome!

Cookies

  • Doesn’t the standard Python library module, Cookie, do this?

    No: module Cookie does the server end of the job. It doesn’t know when to accept cookies from a server or when to send them back. Part of mechanize has been contributed back to the standard library as module cookielib (there are a few differences, notably that cookielib contains thread synchronization code; mechanize does not use cookielib).

  • Which HTTP cookie protocols does mechanize support?

    Netscape and RFC 2965. RFC 2965 handling is switched off by default.

  • What about RFC 2109?

    RFC 2109 cookies are currently parsed as Netscape cookies, and treated by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, or as Netscape cookies otherwise.

  • Why don’t I have any cookies?

    See here.

  • My response claims to be empty, but I know it’s not!

    Did you call response.read() (e.g., in a debug statement), then forget that all the data has already been read? In that case, you may want to use mechanize.response_seek_wrapper. mechanize.Browser always returns seekable responses, so it’s not necessary to use this explicitly in that case.

  • What’s the difference between the .load() and .revert() methods of CookieJar?

    .load() appends cookies from a file. .revert() discards all existing cookies held by the CookieJar first (but it won’t lose any existing cookies if the loading fails).

  • Is it threadsafe?

    No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself.

  • How do I do <X>

    Refer to the API documentation in docstrings.

Forms

  • Doesn’t the standard Python library module, cgi, do this?

    No: the cgi module does the server end of the job. It doesn’t know how to parse or fill in a form or how to send it back to the server.

  • How do I figure out what control names and values to use?

    print form is usually all you need. In your code, things like the HTMLForm.items attribute of HTMLForm instances can be useful to inspect forms at runtime. Note that it’s possible to use item labels instead of item names, which can be useful — use the by_label arguments to the various methods, and the .get_value_by_label() / .set_value_by_label() methods on ListControl.

  • What do those '*' characters mean in the string representations of list controls?

    A * next to an item means that item is selected.

  • What do those parentheses (round brackets) mean in the string representations of list controls?

    Parentheses (foo) around an item mean that item is disabled.

  • Why doesn’t <some control> turn up in the data returned by .click*() when that control has non-None value?

    Either the control is disabled, or it is not successful for some other reason. ‘Successful’ (see HTML 4 specification) means that the control will cause data to get sent to the server.

  • Why does mechanize not follow the HTML 4.0 / RFC 1866 standards for RADIO and multiple-selection SELECT controls?

    Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Use the select_default argument to ParseResponse if you want to follow the RFC 1866 rules instead. Note that browser behaviour violates the HTML 4.01 specification in the case of RADIO controls.

  • Why does .click()ing on a button not work for me?

    • Clicking on a RESET button doesn’t do anything, by design - this is a library for web automation, not an interactive browser. Even in an interactive browser, clicking on RESET sends nothing to the server, so there is little point in having .click() do anything special here.

    • Clicking on a BUTTON TYPE=BUTTON doesn’t do anything either, also by design. This time, the reason is that that BUTTON is only in the HTML standard so that one can attach JavaScript callbacks to its events. Their execution may result in information getting sent back to the server. mechanize, however, knows nothing about these callbacks, so it can’t do anything useful with a click on a BUTTON whose type is BUTTON.

    • Generally, JavaScript may be messing things up in all kinds of ways. See the answer to the next question.

  • How do I change INPUT TYPE=HIDDEN field values (for example, to emulate the effect of JavaScript code)?

    As with any control, set the control’s readonly attribute false.

form.find_control("foo").readonly = False # allow changing .value of control foo
form.set_all_readonly(False) # allow changing the .value of all controls
  • I’m having trouble debugging my code.

    See here for few relevant tips.

  • I have a control containing a list of integers. How do I select the one whose value is nearest to the one I want?

import bisect
def closest_int_value(form, ctrl_name, value):
values = map(int, [item.name for item in form.find_control(ctrl_name).items])
return str(values[bisect.bisect(values, value) - 1])

form["distance"] = [closest_int_value(form, "distance", 23)]

General

  • I want to see what my web browser is doing, but standard network sniffers like wireshark or netcat (nc) don’t work for HTTPS. How do I sniff HTTPS traffic?

    Three good options:

  • JavaScript is messing up my web-scraping. What do I do?

    JavaScript is used in web pages for many purposes — for example: creating content that was not present in the page at load time, submitting or filling in parts of forms in response to user actions, setting cookies, etc. mechanize does not provide any support for JavaScript.

    If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity.

    • Figure out what the JavaScript is doing and emulate it in your Python code: for example, by manually adding cookies to your CookieJar instance, calling methods on HTMLForms, calling urlopen, etc. See above re forms.

    • Use Java’s HtmlUnit or HttpUnit from Jython, since they know some JavaScript.

    • Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions, aka pywin32, aka win32all (e.g. simple function, pamie; pywin32 chapter from the O’Reilly book) or ctypes (example). This kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is PyXPCOM.

    • Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla’s JavaScript interpreter, for instance). This is what HtmlUnit and httpunit do. I did a spike along these lines some years ago, but I think it would (still) be quite a lot of work to do well.

  • Misc links

  • Will any of this code make its way into the Python standard library?

    The request / response processing extensions to urllib2 from mechanize have been merged into urllib2 for Python 2.4. The cookie processing has been added, as module cookielib. There are other features that would be appropriate additions to urllib2, but since Python 2 is heading into bugfix-only mode, and I’m not using Python 3, they’re unlikely to be added.

  • Where can I find out about the relevant standards?

I prefer questions and comments to be sent to the mailing list rather than direct to me.

John J. Lee, October 2010.