[Lxml-checkins] r44756 - in lxml/branch/html: . doc

scoder at codespeak.net scoder at codespeak.net
Thu Jul 5 21:50:03 CEST 2007


Author: scoder
Date: Thu Jul  5 21:50:03 2007
New Revision: 44756

Added:
   lxml/branch/html/doc/cssselect.txt
   lxml/branch/html/doc/lxmlhtml.txt
Modified:
   lxml/branch/html/CREDITS.txt
   lxml/branch/html/doc/mkhtml.py
Log:
initial documentation on lxml.html and lxml.cssselect

Modified: lxml/branch/html/CREDITS.txt
==============================================================================
--- lxml/branch/html/CREDITS.txt	(original)
+++ lxml/branch/html/CREDITS.txt	Thu Jul  5 21:50:03 2007
@@ -5,6 +5,8 @@
 
 Martijn Faassen - creator of lxml and initial main developer
 
+Ian Bicking - creator of lxml.html, lxml.css and the relaxed doctest support
+
 Marc-Antoine Parent - XPath extension function help and patches
 
 Olivier Grisel - improved (c)ElementTree compatibility patches, 

Added: lxml/branch/html/doc/cssselect.txt
==============================================================================
--- (empty file)
+++ lxml/branch/html/doc/cssselect.txt	Thu Jul  5 21:50:03 2007
@@ -0,0 +1,26 @@
+==============
+lxml.cssselect
+==============
+
+lxml supports a number of interesting languages for tree traversal and element
+selection.  The most important is obviously XPath_, but there is also
+ObjectPath_ in the `lxml.objectify`_ module.  The newest child of this family
+is CSS selection, which is implemented in the new ``cssselect`` module.
+
+.. _XPath: xpathxslt.html#xpath
+.. _ObjectPath: objectify.html#objectpath
+.. _`lxml.objectify`: objectify.html
+
+.. contents::
+..
+   1  Finding nodes
+
+
+The CSSSelector class
+=====================
+
+The most important class in the ``cssselect`` module is ``CSSSelector``.  It
+provides the same interface as the XPath_ class, but accepts a CSS selector
+expression as input::
+
+    >>> 
\ No newline at end of file

Added: lxml/branch/html/doc/lxmlhtml.txt
==============================================================================
--- (empty file)
+++ lxml/branch/html/doc/lxmlhtml.txt	Thu Jul  5 21:50:03 2007
@@ -0,0 +1,231 @@
+=========
+lxml.html
+=========
+
+Since version 2.0, lxml provides a dedicated package for dealing with HTML:
+``lxml.html``.  It provides a special Element API for HTML elements, as well
+as a number of utilities for common tasks.
+
+.. contents::
+.. 
+   1  Running HTML doctests
+   2  Parsing HTML
+     2.1  Parsing HTML fragments
+   3  Creating HTML with the E-factory
+   4  Working with links
+   5  Cleaning up HTML
+
+
+The main API is based on the `lxml.etree`_ API, and thus, on the ElementTree_
+API.
+
+.. _`lxml.etree`: tutorial.html
+.. _ElementTree:  http://effbot.org/zone/element-index.htm
+
+
+Running HTML doctests
+=====================
+
+One of the interesting modules in the ``lxml.html`` package deals with
+doctests.  It can be hard to compare two HTML pages for equality, as
+whitespace sequences need to be ignored and the structural formatting can
+differ.  This is even more a problem in doctests, where output is tested for
+equality and small differences in whitespace or the order of attributes can
+let a test fail.  And given the verbosity of tag-based languages, it may take
+more than a quick look to find the actual differences in the doctest output.
+
+Luckily, lxml provides the ``lxml.doctestcompare`` module that supports
+relaxed comparison of XML and HTML pages and provides a readable diff in the
+output when a test fails.  It is most easily used by importing the
+``usedoctest`` module in a doctest::
+
+    >>> from lxml.html import usedoctest
+
+Now, if you have a HTML document and want to compare it to an expected result
+document in a doctest, you can do the following::
+
+    >>> import lxml.html
+    >>> html = lxml.html.HTML('''\
+    ...    <html><body onload="" color="white">
+    ...      <p>Hi  !</p>
+    ...    </body></html>
+    ... ''')
+
+    >>> print lxml.html.tostring(html)
+    <html><body onload="" color="white"><p>Hi !</p></body></html>
+
+    >>> print lxml.html.tostring(html)
+    <html> <body color="white" onload=""> <p>Hi    !</p> </body> </html>
+
+    >>> print lxml.html.tostring(html)
+    <html>
+      <body color="white" onload="">
+        <p>Hi !</p>
+      </body>
+    </html>
+
+In documentation, you would likely prefer the pretty printed HTML output, as
+it is the most readable.  However, the three documents are equivalent from the
+point of view of an HTML tool, so the doctest will silently accept any of the
+above.  This allows you to concentrate on readability in your doctests, even
+if the real output is a straight ugly HTML one-liner.
+
+Note that there is also an ``lxml.usedoctest`` module which you can import for
+XML comparisons.
+
+
+Parsing HTML
+============
+
+
+Parsing HTML fragments
+----------------------
+
+
+Creating HTML with the E-factory
+================================
+
+.. _`E-factory`: http://FIXME/
+
+lxml.html comes with a predefined HTML vocabulary for the `E-factory`_,
+originally written by Fredrik Lundh.  This allows you to quickly generate HTML
+pages and fragments::
+
+    >>> from lxml.html import builder as h
+
+    >>> html = h.HTML(
+    ...   h.HEAD(
+    ...     h.LINK(rel="stylesheet", href="great.css", type="text/css"),
+    ...     h.TITLE("Best Page Ever")
+    ...   ),
+    ...   h.BODY(
+    ...     h.H1(h.CLASS("heading"), "Top News"),
+    ...     h.P("World News only on this page", style="font-size: 200%"),
+    ...     "Ah, and here's some more text, by the way.",
+    ...     lxml.html.HTML("<p>... and this is a parsed fragment ...</p>")
+    ...   )
+    ... )
+
+    >>> print lxml.html.tostring(html)
+    <html>
+      <head>
+        <link href="great.css" rel="stylesheet" type="text/css">
+        <title>Best Page Ever</title>
+      </head>
+      <body>
+        <h1 class="heading">Top News</h1>
+        <p style="font-size: 200%">World News only on this page</p>
+        Ah, and here's some more text, by the way.
+        <p>... and this is a parsed fragment ...</p>
+      </body>
+    </html>
+
+
+Working with links
+==================
+
+
+Cleaning up HTML
+================
+
+The module ``lxml.html.clean`` provides a ``Cleaner`` class for cleaning up
+HTML pages.  It supports removing embedded or script content, special tags,
+CSS style annotations and much more.
+
+Say, you have an evil web page from an untrusted source that contains lots of
+content that upsets browsers and tries to run evil code on the client side::
+
+    >>> html = '''\
+    ... <html>
+    ...  <head>
+    ...    <script type="text/javascript" src="evil-site"></script>
+    ...    <link rel="alternate" type="text/rss" src="evil-rss">
+    ...    <style>
+    ...      body {background-image: url(javascript:do_evil)};
+    ...      div {color: expression(evil)};
+    ...    </style>
+    ...  </head>
+    ...  <body onload="evil_function()">
+    ...    <!-- I am interpreted for EVIL! -->
+    ...    <a href="javascript:evil_function()">a link</a>
+    ...    <a href="#" onclick="evil_function()">another link</a>
+    ...    <p onclick="evil_function()">a paragraph</p>
+    ...    <div style="display: none">secret EVIL!</div>
+    ...    <object> of EVIL! </object>
+    ...    <iframe src="evil-site"></iframe>
+    ...    <form action="evil-site">
+    ...      Password: <input type="password" name="password">
+    ...    </form>
+    ...    <blink>annoying EVIL!</blink>
+    ...    <a href="evil-site">spam spam SPAM!</a>
+    ...    <image src="evil!">
+    ...  </body>
+    ... </html>'''
+
+To remove the all suspicious content from this unparsed document, use the
+``clean_html`` function.::
+
+    >>> from lxml.html.clean import clean_html
+    
+    >>> print clean_html(html)
+    <html>
+      <body>
+        <div>
+          <style>/* deleted */</style>
+          <a href="">a link</a>
+          <a href="#">another link</a>
+          <p>a paragraph</p>
+          <div>secret EVIL!</div>
+          of EVIL!
+          Password:
+          annoying EVIL!
+          <a href="evil-site">spam spam SPAM!</a>
+          <img src="evil!">
+        </div>
+      </body>
+    </html>
+
+The ``Cleaner`` class supports several keyword arguments to control exactly
+which content is removed::
+
+    >>> from lxml.html.clean import Cleaner
+
+    >>> cleaner = Cleaner(page_structure=False, links=False)
+    >>> print cleaner.clean_html(html)
+    <html>
+      <head>
+        <link rel="alternate" src="evil-rss" type="text/rss">
+        <style>/* deleted */</style>
+      </head>
+      <body>
+        <a href="">a link</a>
+        <a href="#">another link</a>
+        <p>a paragraph</p>
+        <div>secret EVIL!</div>
+        of EVIL!
+        Password:
+        annoying EVIL!
+        <a href="evil-site">spam spam SPAM!</a>
+        <img src="evil!">
+      </body>
+    </html>
+
+    >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
+    ...                   page_structure=False, safe_attrs_only=False)
+    
+    >>> print cleaner.clean_html(html)
+    <html>
+      <head>
+      </head>
+      <body>
+        <a href="">a link</a>
+        <a href="#">another link</a>
+        <p>a paragraph</p>
+        <div>secret EVIL!</div>
+        of EVIL!
+        Password:
+        annoying EVIL!
+        <a href="evil-site" rel="nofollow">spam spam SPAM!</a>
+        <img src="evil!">
+      </body>
+    </html>

Modified: lxml/branch/html/doc/mkhtml.py
==============================================================================
--- lxml/branch/html/doc/mkhtml.py	(original)
+++ lxml/branch/html/doc/mkhtml.py	Thu Jul  5 21:50:03 2007
@@ -6,7 +6,8 @@
               'performance.txt', 'build.txt')),
     ('Developing with lxml', ('tutorial.txt', 'api.txt', 'parsing.txt',
                               'validation.txt', 'xpathxslt.txt',
-                              'objectify.txt')),
+                              'objectify.txt', 'lxmlhtml.txt',
+                              'cssselect.txt')),
     ('Extending lxml', ('resolvers.txt', 'extensions.txt',
                         'element_classes.txt', 'sax.txt', 'capi.txt')),
     ]


More information about the lxml-checkins mailing list