- Is there any example code?
Look in the examples directory of mechanize.
Note that the examples on the ClientForm page
are executable as-is. Contributions of example code would be very
- HTTPS on Windows?
_socket.pyd, or use Python 2.3.
- I want to see what my web browser is doing, but standard network sniffers
like ethereal or netcat (nc) don't
work for HTTPS. How do I sniff HTTPS traffic?
Three good options:
I'm told you can also use a proxy like proxomitron (never tried it
myself). There's also a commercial MSIE
- Embedded script is messing up my web-scraping. What do I do?
It is possible to embed script in HTML pages (sandwiched between
<SCRIPT>here</SCRIPT> tags, and in
even Python. These scripts can do all sorts of things, including causing
cookies to be set in a browser, submitting or filling in parts of forms in
response to user actions, changing link colours as the mouse moves over a
If you come across this in a page you want to automate, you
have four options. Here they are, roughly in order of simplicity.
- Simply figure out what the embedded script is doing and emulate it
in your Python code: for example, by manually adding cookies to your
CookieJar instance, calling methods on
- Dump mechanize and ClientForm and automate a browser instead.
For example use MS Internet Explorer via its COM automation interfaces, using
the Python for
Windows extensions, aka pywin32, aka win32all (eg.
pywin32 chapter from the O'Reilly book) or
example: may be out of date, since
ctypes' COM support is
kind of thing may also come in useful on Windows for cases where the
automation API is lacking.
is a binding to the
epiphany web browser, allowing both plugins and automation code to be
written in Python.
XXX Mozilla automation & XPCOM / PyXPCOM, Konqueror & DCOP / KParts / PyKDE).
- Use Java's httpunit from
- Get ambitious and automatically delegate the work to an appropriate
approach is the one taken by DOMForm (the
- Misc links
Soup is a widely recommended HTML-parsing module.
contains useful stuff like persistent connections, mirroring and
throttling, and it looks like most or all of it is well-integrated with
urllib2 (originally part of the yum package manager, but
now becoming a separate project).
- Another Java thing: maxq,
which provides a proxy to aid automatic generation of functional tests
written in Jython using the standard library unittest module (PyUnit)
and the "Jakarta Commons" HttpClient library.
- A useful set Zope-oriented links on tools for testing
- O'Reilly book: Spidering Hacks. Very Perl-oriented.
which, amongst other things, can display HTML form information and
HTML table structure(thanks to Erno Kuusela for this link).
Selenium: In-browser web
source functional testing tools. A nice list.
A HOWTO on web scraping from Dave Kuhlman.
- Will any of this code make its way into the Python standard library?
The request / response processing extensions to
mechanize have been merged into
urllib2 for Python 2.4.
The cookie processing has been added, as module
Eventually, I'll submit patches to get the http-equiv, refresh, and
robots.txt code in there too, and maybe
too (but not
mechanize.Browser). The rest, probably
I prefer questions and comments to be sent to the
mailing list rather than direct to me.
John J. Lee,