ClientForm is a Python module for handling HTML forms on the client
side, useful for parsing HTML forms, filling them in and returning the
completed forms to the server. It developed from a port of Gisle Aas'
Perl module HTML::Form
, from the libwww-perl library, but the
interface is not the same.
Simple example:
from urllib2 import urlopen from ClientForm import ParseResponse forms = ParseResponse(urlopen("http://www.acme.com/form.html")) form = forms[0] print form form["author"] = "Gisle Aas" # form.click returns a urllib2.Request object # (see HTMLForm.click.__doc__ if you don't have urllib2) response = urlopen(form.click("Thanks"))
A more complicated example:
import ClientForm import urllib2 request = urllib2.Request("http://www.acme.com/form.html") response = urllib2.urlopen(request) forms = ClientForm.ParseResponse(response) form = forms[0] print form # very useful! # Indexing allows setting and retrieval of control values original_text = form["comments"] # a string, NOT a Control instance form["comments"] = "Blah." print form.possible_values("cheeses") # Controls that represent lists (checkbox, select and radio lists) are # ListControls, and come in two flavours: single- and multiple-selection # lists. Both can take a string as a value. form["cheeses"] = "cheddar" # multi form["favorite_cheese"] = "brie" # single # None is also acceptable form["cheeses"] = None # Multiple-selection lists can also take a sequence of strings form["cheeses"] = ["parmesan", "leicester", "cheddar"] # HTMLForm has some other useful methods form.toggle("cheeses", "gorgonzola") # Checkbox and radio items whose HTML has no value attribute (this is # often the case where a single checkbox makes up the whole checkbox # control) default to the value "on" (this isn't my ugly hack, it's the # browser manufacturers'), so to check such a control: form["deeppan"] = "on" # ["on"] would also work for a checkbox # and to un-check form["deeppan"] = None # [] would also work for a checkbox # The find_control method allows access to the contained Control objects # that represent the textareas, checkbox lists, etc, etc. In the case of # lists, Controls may correspond to multiple HTML elements. control = form.find_control(name="cheeses") print control.value # equivalent to form["cheeses"].value # The type and nr arguments to find_control also allow more precision than # indexing. control = form.find_control(name="cheeses", type="select", nr=1) print control.name, control.type assert control.multiple # All Controls may be disabled (equivalent of greyed-out in browser) assert not control.disabled # TextControls may be readonly assert not form.find_control("comments").readonly # Controls also have methods on them that are useful for doing more # obscure things -- these two are equivalent: assert control.type == "select", \ "only SelectControl has toggle_by_label method" control.toggle("gorgonzola") control.toggle_by_label(["NEW! Special Offer on Gorgonzola"]) request2 = form.click("Submit") # urllib2.Request object response2 = urllib2.urlopen(request2) print response2.geturl() print response2.info() # headers for line in response2.readlines(): # body print line
All of the standard control types are supported: TEXT
,
PASSWORD
, HIDDEN
, TEXTAREA
,
ISINDEX
, RESET
, BUTTON
,
SUBMIT
, IMAGE
, RADIO
,
CHECKBOX
, SELECT
/OPTION
.
FILE
(for file upload) is not supported in the 0.0.x
version.
The module is designed for testing and automation of web interfaces, not for implementing interactive user agents.
Security note: Remember that any passwords you store
in HTMLForm
instances will be saved to disk in the clear if you pickle
them (directly or indirectly). The simplest solution to this is to
avoid pickling HTMLForm
objects.
Python 1.5.2 or above is required. To run the tests, you need the
unittest
module (from PyUnit).
unittest
is a standard library module with Python 2.1 and
above.
For full documentation, see the docstrings in ClientForm.py.
Note: this page describes the 0.0.x interface. See here for the old 0.0.x interface.
For installation instructions, see the INSTALL file included in the distribution.
Stable release.
Old release.
cgi
, do this?
No: the cgi
module does the server end of the job. It
doesn't know how to fill in a form or how to send it back to the
server.
1.5.2 or above.
urllib2
required?
No.
urllib2
?
Use the click
and items
methods. Pass a true
items
argument to the click
method. Don't
use make_request
.
urllib2
do I need?
You don't. It's convenient, though. If you have Python 2.0, you
need to upgrade to
the version from Python 2.1 (or use the 1.5.2-compatible
version). If you have Python 1.5.2, use this urllib2
and urllib
. Otherwise, you're
OK.
The BSD license (included in distribution).
print form
is usually all you need (version 0.0.9 provides
a more useful & informative HTMLForm.__str__
method than
previous releases). HTMLForm.possible_values
can be useful.
Note that SelectControl
supports use of item labels (which
default to OPTION
element contents) as well as item names,
which can be useful.
'*'
characters mean in the string
representations of list controls?
A *
next to an item means that item is selected.
Note that this is different from LWP's HTML::Form
.
.items()
when it is listed in .value
?
Either the control is disabled, or it is not successful for some other
reason. 'Successful' (see HTML 4 specification) means that the control
will cause data to get sent to the server. You probably want to use the
click
method instead, anyway: if you don't, no
SUBMIT
or IMAGE
controls will be successful.
RADIO
and multiple-selection SELECT
controls?
Because by default, it follows browser behaviour when setting the initially-selected items in list controls that have no items explicitly selected in the HTML. Use the select_default argument to ParseResponse if you want to follow the RFC 1866 rules instead. Note that browser behaviour violates the HTML 4.01 specification in the case of RADIO controls.
RESET
button doesn't do anything, by design
- this is a library for web automation, not an interactive browser.
Even in an interactive browser, clicking on RESET
sends
nothing to the server, so there is little point in having click() do
anything special here.
BUTTON TYPE=BUTTON
doesn't do anything
either, also by design. This time, the reason is that that
BUTTON
is only in the HTML standard so that one can attach
callbacks to its events. The callbacks are functions in
SCRIPT
elements (such as Javascript) embedded in the HTML,
and their execution may result in information getting sent back to the
server. ClientForm, however, knows nothing about these callbacks, so
it can't do anything useful with a click on a BUTTON
whose
type is BUTTON
.
The simplest solution to this is to figure out what the script does, and emulate it in your Python code. An alternative is to dump ClientForm and automate a browser (eg. use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions). Eventually, ClientForm may be able to make use of a Javascript interpreter to automatically interpret embedded code. That would solve the simpler Javascript issues, but is pure vapourware ATM.
The ClientCookie package makes it easy to
get seek()
able response objects, which is convenient for
debugging. The debugging section of the module docstring for ClientCookie
also contains a few relevant tips.
import bisect def closest_int_value(form, ctrl_name, value): values = map(int, form.find_control(ctrl_name).possible_values()) return str(values[bisect.bisect(values, value) - 1]) form["distance"] = closest_int_value(form, "distance", 23)
John J. Lee, December 2003.