[Lxml-checkins] r46803 - lxml/trunk/doc
ianb at codespeak.net
ianb at codespeak.net
Fri Sep 21 19:33:50 CEST 2007
Author: ianb
Date: Fri Sep 21 19:33:49 2007
New Revision: 46803
Modified:
lxml/trunk/doc/lxmlhtml.txt
Log:
Fixed all the parsing references. Fix builder import. Added section on parsing. Added docs on forms.
Modified: lxml/trunk/doc/lxmlhtml.txt
==============================================================================
--- lxml/trunk/doc/lxmlhtml.txt (original)
+++ lxml/trunk/doc/lxmlhtml.txt Fri Sep 21 19:33:49 2007
@@ -29,6 +29,35 @@
Parsing HTML fragments
----------------------
+There are several functions available to parse HTML:
+
+``parse(filename_url_or_file)``:
+ Parses the named file or url, or if the object has a ``.read()``
+ method, parses from that.
+
+ If you give a URL, or if the object has a ``.geturl()`` method (as
+ file-like objects from ``urllib.urlopen()`` have), then that URL
+ is used as the base URL.
+
+``document_fromstring(string)``:
+ Parses a document from the given string. This always creates a
+ correct HTML document, which means the parent node is ``<html>``,
+ and there is a body and possibly a head.
+
+``fragment_fromstring(string, create_parent=False)``:
+ Returns an HTML fragment from a string. The fragment must contain
+ just a single element, unless ``create_parent`` is given;
+ e.g,. ``fragment_fromstring(string, create_parent='div')`` will
+ wrap the element in a ``<div>``.
+
+``fragments_fromstring(string)``:
+ Returns a list of the elements found in the fragment.
+
+``fromstring(string)``:
+ Returns ``document_fromstring`` or ``fragment_fromstring``, based
+ on whether the string looks like a full document, or just a
+ fragment.
+
HTML Element Methods
====================
@@ -71,6 +100,16 @@
selector expression. (Note that ``.xpath(expr)`` is also
available as on all lxml elements.)
+``.label``:
+ Returns the corresponding ``<label>`` element for this element, if
+ any exists (None if there is none). Label elements have a
+ ``label.for_element`` attribute that points back to the element.
+
+``.base_url``:
+ The base URL for this element, if one was saved from the parsing.
+ This attribute is not settable. Is None when no base URL was
+ saved.
+
Running HTML doctests
=====================
@@ -94,7 +133,7 @@
document in a doctest, you can do the following::
>>> import lxml.html
- >>> html = lxml.html.HTML('''\
+ >>> html = lxml.html.fromstring('''\
... <html><body onload="" color="white">
... <p>Hi !</p>
... </body></html>
@@ -133,18 +172,18 @@
originally written by Fredrik Lundh. This allows you to quickly generate HTML
pages and fragments::
- >>> from lxml.html import builder as h
-
- >>> html = h.HTML(
- ... h.HEAD(
- ... h.LINK(rel="stylesheet", href="great.css", type="text/css"),
- ... h.TITLE("Best Page Ever")
+ >>> from lxml.html import builder as E
+ >>> from lxml.html import usedoctest
+ >>> html = E.HTML(
+ ... E.HEAD(
+ ... E.LINK(rel="stylesheet", href="great.css", type="text/css"),
+ ... E.TITLE("Best Page Ever")
... ),
- ... h.BODY(
- ... h.H1(h.CLASS("heading"), "Top News"),
- ... h.P("World News only on this page", style="font-size: 200%"),
+ ... E.BODY(
+ ... E.H1(E.CLASS("heading"), "Top News"),
+ ... E.P("World News only on this page", style="font-size: 200%"),
... "Ah, and here's some more text, by the way.",
- ... lxml.html.HTML("<p>... and this is a parsed fragment ...</p>")
+ ... lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
... )
... )
@@ -169,6 +208,13 @@
serialized as ``<script src="..." />``, which completely confuses
browsers.
+Viewing your HTML
+-----------------
+
+A handy method for viewing your HTML:
+``lxml.html.open_in_browser(lxml_doc)`` will write the document to
+disk and open it in a browser (with the `webbrowser module
+<http://python.org/doc/current/lib/module-webbrowser.html>`_).
Working with links
==================
@@ -220,7 +266,6 @@
If you want access to the context of the link, you should use
``.iterlinks()`` instead.
-
Functions
---------
@@ -236,6 +281,149 @@
(except for ``iterlinks()``), the method performed, and the new document
returned.
+Forms
+=====
+
+Any ``<form>`` elements in a document are available through
+the list ``doc.forms`` (e.g., ``doc.forms[0]``). Form, input, select,
+and textarea elements each have special methods.
+
+Input elements (including ``<select>`` and ``<textarea>``) have these
+attributes:
+
+``.name``:
+ The name of the element.
+
+``.value``:
+ The value of an input, the content of a textarea, the selected
+ option(s) of a select. This attribute can be set.
+
+ In the case of a select that takes multiple options (``<select
+ multiple>``) this will be a set of the selected options; you can
+ add or remove items to select and unselect the options.
+
+Select attributes:
+
+``.value_options``:
+ For select elements, this is all the *possible* values (the values
+ of all the options).
+
+``.multiple``:
+ For select elements, true if this is a ``<select multiple>``
+ element.
+
+Input attributes:
+
+``.type``:
+ The type attribute in ``<input>`` elements.
+
+``.checkable``:
+ True if this can be checked (i.e., true for type=radio and
+ type=checkbox).
+
+``.checked``:
+ If this element is checkable, the checked state. Raises
+ AttributeError on non-checkable inputs.
+
+The form itself has these attributes:
+
+``.inputs``:
+ A dictionary-like object that can be used to access input elements
+ by name. When there are multiple input elements with the same
+ name, this returns list-like structures that can also be used to
+ access the options and their values as a group.
+
+``.fields``:
+ A dictionary-like object used to access *values* by their name.
+ ``form.inputs`` returns elements, this only returns values.
+ Setting values in this dictionary will effect the form inputs.
+ Basically ``form.fields[x]`` is equivalent to
+ ``form.inputs[x].value`` and ``form.fields[x] = y`` is equivalent
+ to ``form.inputs[x].value = y``. (Note that sometimes
+ ``form.inputs[x]`` returns a compound object, but these objects
+ also have ``.value`` attributes.)
+
+ If you set this attribute, it is equivalent to
+ ``form.fields.clear(); form.fields.update(new_value)``
+
+``.form_values()``:
+ Returns a list of ``[(name, value), ...]``, suitable to be passed
+ to ``urllib.urlencode()`` for form submission.
+
+``.action``:
+ The ``action`` attribute. This is resolved to an absolute URL if
+ possible.
+
+``.method``:
+ The ``method`` attribute, which defaults to ``GET``.
+
+Form Filling Example
+--------------------
+
+Note that you can change any of these attributes (values, method,
+action, etc) and then serialize the form to see the updated values.
+You can, for instance, do::
+
+ >>> from lxml.html import fromstring, tostring
+ >>> form_page = fromstring('''<html><body><form>
+ ... Your name: <input type="text" name="name"> <br>
+ ... Your phone: <input type="text" name="phone"> <br>
+ ... Your favorite pets: <br>
+ ... Dogs: <input type="checkbox" name="interest" value="dogs"> <br>
+ ... Cats: <input type="checkbox" name="interest" value="cats"> <br>
+ ... Llamas: <input type="checkbox" name="interest" value="llamas"> <br>
+ ... <input type="submit"></form></body></html>''')
+ >>> form = form_page.forms[0]
+ >>> form.fields = dict(
+ ... name='John Smith',
+ ... phone='555-555-3949',
+ ... interest=set(['cats', 'llamas']))
+ >>> print tostring(form)
+ <html>
+ <body>
+ <form>
+ Your name:
+ <input name="name" type="text" value="John Smith">
+ <br>Your phone:
+ <input name="phone" type="text" value="555-555-3949">
+ <br>Your favorite pets:
+ <br>Dogs:
+ <input name="interest" type="checkbox" value="dogs">
+ <br>Cats:
+ <input checked name="interest" type="checkbox" value="cats">
+ <br>Llamas:
+ <input checked name="interest" type="checkbox" value="llamas">
+ <br>
+ <input type="submit">
+ </form>
+ </body>
+ </html>
+
+
+Form Submission
+---------------
+
+You can submit a form with ``lxml.html.submit_form(form_element)``.
+This will return a file-like object (the result of
+``urllib.urlopen()``).
+
+If you have extra input values you want to pass you can use the
+keyword argument ``extra_values``, like ``extra_values={'submit':
+'Yes!'}``. This is the only way to get submit values into the form,
+as there is no state of "submitted" for these elements.
+
+You can pass in an alternate opener with the ``open_http`` keyword
+argument, which is a function with the signature ``open_http(method,
+url, values)``.
+
+Example::
+
+ >>> from lxml.html import parse, submit_form
+ >>> page = parse('http://tinyurl.com').getroot()
+ >>> page.forms[1].fields['url'] = 'http://codespeak.net/lxml/'
+ >>> result = parse(submit_form(page.forms[1])).getroot()
+ >>> [a.attrib['href'] for a in result.xpath("//a[@target='_blank']")]
+ ['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
Cleaning up HTML
================
@@ -414,10 +602,10 @@
First we get the page::
>>> import urllib
- >>> from lxml.html import HTML
+ >>> from lxml.html import fromstring
>>> url = 'http://microformats.org/'
>>> content = urllib.urlopen(url).read()
- >>> doc = HTML(content)
+ >>> doc = fromstring(content)
>>> doc.make_links_absolute(url)
Then we create some objects to put the information in::
More information about the lxml-checkins
mailing list