[Lxml-checkins] r44764 - lxml/branch/html/doc
ianb at codespeak.net
ianb at codespeak.net
Fri Jul 6 02:03:51 CEST 2007
Author: ianb
Date: Fri Jul 6 02:03:49 2007
New Revision: 44764
Modified:
lxml/branch/html/doc/cssselect.txt
lxml/branch/html/doc/lxmlhtml.txt
Log:
added some docs
Modified: lxml/branch/html/doc/cssselect.txt
==============================================================================
--- lxml/branch/html/doc/cssselect.txt (original)
+++ lxml/branch/html/doc/cssselect.txt Fri Jul 6 02:03:49 2007
@@ -15,7 +15,6 @@
..
1 Finding nodes
-
The CSSSelector class
=====================
@@ -23,4 +22,70 @@
provides the same interface as the XPath_ class, but accepts a CSS selector
expression as input::
- >>>
\ No newline at end of file
+ >>> sel = CSSSelector('div.content')
+ >>> sel
+ <CSSSelector ... for 'div.content'>
+
+The selector actually compiles to XPath, and you can see the
+expression by inspecting the object::
+
+ >>> sel.path
+ "descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' content ')]"
+
+To use the selector, simply call it with a document or element
+object::
+
+ >>> from lxml.etree import HTML
+ >>> h = HTML('''<div id="outer">
+ ... <div id="inner" class="content body">
+ ... text
+ ... </div></div>''')
+ >>> [e.get('id') for e in sel(h)]
+ ['inner']
+
+CSS Selectors
+=============
+
+This libraries attempts to implement CSS selectors `as described in
+the w3c specification
+<http://www.w3.org/TR/2001/CR-css3-selectors-20011113/>`_. Many of
+the pseudo-classes do not apply in this context, including all
+`dynamic pseudo-classes
+<http://www.w3.org/TR/2001/CR-css3-selectors-20011113/#dynamic-pseudos>`_.
+In particular these will not be available:
+
+* link state: ``:link``, ``:visited``, ``:target``
+* actions: ``:hover``, ``:active``, ``:focus``
+* UI states: ``:enabled``, ``:disabled``, ``:indeterminate``
+ (``:checked`` and ``:unchecked`` *are* available)
+
+Also, none of the psuedo-elements apply, because the selector only
+returns elements and psuedo-elements select portions of text, like
+``::first-line``.
+
+Namespaces
+----------
+
+In CSS you can use ``namespace-prefix|element``, similar to
+``namespace-prefix:element`` in an XPath expression. In fact, it maps
+one-to-one, and the same rules are used to map namespace prefixes to
+namespace URIs.
+
+Limitations
+===========
+
+These applicable pseudoclasses are not yet implemented:
+
+* ``:lang(language)``
+* ``:root``
+* ``*:first-of-type``, ``*:last-of-type``, ``*:nth-of-type``,
+ ``*:nth-last-of-type``, ``*:only-of-type``. All of these work when
+ you specify an element type, but not with ``*``
+
+Unlike XPath you cannot provide parameters in your expressions -- all
+expressions are completely static.
+
+XPath has underspecified string quoting rules (there seems to be no
+string quoting at all), so if you use expressions that contain
+characters that requiring quoting you might have problems with the
+translation from CSS to XPath.
Modified: lxml/branch/html/doc/lxmlhtml.txt
==============================================================================
--- lxml/branch/html/doc/lxmlhtml.txt (original)
+++ lxml/branch/html/doc/lxmlhtml.txt Fri Jul 6 02:03:49 2007
@@ -28,16 +28,17 @@
One of the interesting modules in the ``lxml.html`` package deals with
doctests. It can be hard to compare two HTML pages for equality, as
-whitespace sequences need to be ignored and the structural formatting can
-differ. This is even more a problem in doctests, where output is tested for
-equality and small differences in whitespace or the order of attributes can
-let a test fail. And given the verbosity of tag-based languages, it may take
-more than a quick look to find the actual differences in the doctest output.
-
-Luckily, lxml provides the ``lxml.doctestcompare`` module that supports
-relaxed comparison of XML and HTML pages and provides a readable diff in the
-output when a test fails. It is most easily used by importing the
-``usedoctest`` module in a doctest::
+whitespace differences aren't meaningful and the structural formatting
+can differ. This is even more a problem in doctests, where output is
+tested for equality and small differences in whitespace or the order
+of attributes can let a test fail. And given the verbosity of
+tag-based languages, it may take more than a quick look to find the
+actual differences in the doctest output.
+
+Luckily, lxml provides the ``lxml.doctestcompare`` module that
+supports relaxed comparison of XML and HTML pages and provides a
+readable diff in the output when a test fails. The HTML comparison is
+most easily used by importing the ``usedoctest`` module in a doctest::
>>> from lxml.html import usedoctest
@@ -70,8 +71,9 @@
above. This allows you to concentrate on readability in your doctests, even
if the real output is a straight ugly HTML one-liner.
-Note that there is also an ``lxml.usedoctest`` module which you can import for
-XML comparisons.
+Note that there is also an ``lxml.usedoctest`` module which you can
+import for XML comparisons. The HTML parser notably ignores
+namespaces and some other XMLisms.
Parsing HTML
@@ -120,10 +122,76 @@
</body>
</html>
+Note that you should use ``lxml.html.tostring`` and **not**
+``lxml.tostring``. ``lxml.tostring(doc)`` will return the XML
+representation of the document, which is not valid HTML. In
+particular, things like ``<script src="..."></script>`` will be
+serialized as ``<script src="..." />``, which completely confuses
+browsers.
Working with links
==================
+There are several methods on elements that allow you to see and modify
+the links in a document.
+
+``.iter_links()``:
+ This yields ``(element, attribute, link, pos)`` for every link in
+ the document. ``attribute`` may be None if the link is in the
+ text (as will be the case with a ``<style>`` tag with
+ ``@import``).
+
+ This finds any link in an ``action``, ``archive``, ``background``,
+ ``cite``, ``classid``, ``codebase``, ``data``, ``href``,
+ ``longdesc``, ``profile``, ``src``, ``usemap``, ``dynsrc``, or
+ ``lowsrc`` attribute. It also searches ``style`` attributes for
+ ``url(link)``, and ``<style>`` tags for ``@import`` and ``url()``.
+
+ This function does *not* pay attention to ``<base href>``.
+
+``.resolve_base_href()``:
+ This function will modify the document in-place to take account of
+ ``<base href>`` if the document contains that tag. In the process
+ it will also remove that tag from the document.
+
+``.make_links_absolute(base_href, resolve_base_href=True)``:
+ This makes all links in the document absolute, assuming that
+ ``base_href`` is the URL of the document. So if you pass
+ ``base_href="http://localhost/foo/bar.html"`` and there is a link
+ to ``baz.html`` that will be rewritten as
+ ``http://localhost/foo/baz.html``.
+
+ If ``resolve_base_href`` is true, then any ``<base href>`` tag
+ will be taken into account (just calling
+ ``self.resolve_base_href()``).
+
+``.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None)``:
+ This rewrites all the links in the document using your given link
+ replacement function. If you give a ``base_href`` value, all
+ links will be passed in after they are joined with this URL.
+
+ For each link ``link_repl_func(link)`` is called. That function
+ then returns the new link, or None to remove the attribute or tag
+ that contains the link. Note that all links will be passed in,
+ including links like ``"#anchor"`` (which is purely internal), and
+ things like ``"mailto:bob at example.com"`` (or ``javascript:...``).
+
+ If you want access to the context of the link, you should use
+ ``.iter_links()`` instead.
+
+Functions
+---------
+
+In addition to these methods, there are corresponding functions:
+
+* ``iter_links(html)``
+* ``make_links_absolute(html, base_href, ...)``
+* ``rewrite_links(html, link_repl_func, ...)``
+* ``resolve_base_href(html)``
+
+These functions will parse ``html`` if it is a string, then return the
+new HTML as a string. If you pass in a document, the document will be
+copied, the method performed, and the new document returned.
Cleaning up HTML
================
@@ -229,3 +297,57 @@
<img src="evil!">
</body>
</html>
+
+See the docstring of ``Cleaner`` for the details of what can be
+cleaned.
+
+autolink
+--------
+
+In addition to cleaning up malicious HTML, ``lxml.html.clean``
+contains functions to do other things to your HTML. This includes
+autolinking:
+
+ ``autolink(doc, ...)`` and ``autolink_html(html, ...)``
+
+This finds anything that looks like a link (e.g.,
+``http://example.com``) in the *text* of an HTML document, and
+turns it into an anchor. It avoids making bad links.
+
+Links in the elements ``<textarea>``, ``<pre>``, ``<code>``,
+anything in the head of the document. You can pass in a list of
+elements to avoid in ``avoid_elements=['textarea', ...]```.
+
+Links to some hosts can be avoided. By default links to
+``localhost*``, ``example.*`` and ``127.0.0.1`` are not
+autolinked. Pass in ``avoid_hosts=[list_of_regexes]`` to control
+this.
+
+Elements with the ``nolink`` CSS class are not autolinked. Pass
+in ``avoid_classes=['code', ...]`` to control this.
+
+The ``autolink_html()`` version of the function parses the HTML
+string first, and returns a string.
+
+wordwrap
+--------
+
+You can also wrap long words in your html:
+
+ ``word_break(doc, max_width=40, ...)`` and ``word_break_html(html, ...)``
+
+This finds any long words in the text of the document and inserts
+``​`` in the document (which is the Unicode zero-width space).
+
+This avoids the elements ``<pre>``, ``<textarea>``, and ``<code>``.
+You can control this with ``avoid_elements=['textarea', ...]``.
+
+It also avoids elements with the CSS class ``nobreak``. You can
+control this with ``avoid_classes=['code', ...]``.
+
+Lastly you can control the character that is inserted with
+``break_character=u'\u200b'``. However, you cannot insert markup,
+only text.
+
+``word_break_html(html)`` parses the HTML document and returns a
+string.
More information about the lxml-checkins
mailing list