[Lxml-checkins] r44756 - in lxml/branch/html: . doc
scoder at codespeak.net
scoder at codespeak.net
Thu Jul 5 21:50:03 CEST 2007
Author: scoder
Date: Thu Jul 5 21:50:03 2007
New Revision: 44756
Added:
lxml/branch/html/doc/cssselect.txt
lxml/branch/html/doc/lxmlhtml.txt
Modified:
lxml/branch/html/CREDITS.txt
lxml/branch/html/doc/mkhtml.py
Log:
initial documentation on lxml.html and lxml.cssselect
Modified: lxml/branch/html/CREDITS.txt
==============================================================================
--- lxml/branch/html/CREDITS.txt (original)
+++ lxml/branch/html/CREDITS.txt Thu Jul 5 21:50:03 2007
@@ -5,6 +5,8 @@
Martijn Faassen - creator of lxml and initial main developer
+Ian Bicking - creator of lxml.html, lxml.css and the relaxed doctest support
+
Marc-Antoine Parent - XPath extension function help and patches
Olivier Grisel - improved (c)ElementTree compatibility patches,
Added: lxml/branch/html/doc/cssselect.txt
==============================================================================
--- (empty file)
+++ lxml/branch/html/doc/cssselect.txt Thu Jul 5 21:50:03 2007
@@ -0,0 +1,26 @@
+==============
+lxml.cssselect
+==============
+
+lxml supports a number of interesting languages for tree traversal and element
+selection. The most important is obviously XPath_, but there is also
+ObjectPath_ in the `lxml.objectify`_ module. The newest child of this family
+is CSS selection, which is implemented in the new ``cssselect`` module.
+
+.. _XPath: xpathxslt.html#xpath
+.. _ObjectPath: objectify.html#objectpath
+.. _`lxml.objectify`: objectify.html
+
+.. contents::
+..
+ 1 Finding nodes
+
+
+The CSSSelector class
+=====================
+
+The most important class in the ``cssselect`` module is ``CSSSelector``. It
+provides the same interface as the XPath_ class, but accepts a CSS selector
+expression as input::
+
+ >>>
\ No newline at end of file
Added: lxml/branch/html/doc/lxmlhtml.txt
==============================================================================
--- (empty file)
+++ lxml/branch/html/doc/lxmlhtml.txt Thu Jul 5 21:50:03 2007
@@ -0,0 +1,231 @@
+=========
+lxml.html
+=========
+
+Since version 2.0, lxml provides a dedicated package for dealing with HTML:
+``lxml.html``. It provides a special Element API for HTML elements, as well
+as a number of utilities for common tasks.
+
+.. contents::
+..
+ 1 Running HTML doctests
+ 2 Parsing HTML
+ 2.1 Parsing HTML fragments
+ 3 Creating HTML with the E-factory
+ 4 Working with links
+ 5 Cleaning up HTML
+
+
+The main API is based on the `lxml.etree`_ API, and thus, on the ElementTree_
+API.
+
+.. _`lxml.etree`: tutorial.html
+.. _ElementTree: http://effbot.org/zone/element-index.htm
+
+
+Running HTML doctests
+=====================
+
+One of the interesting modules in the ``lxml.html`` package deals with
+doctests. It can be hard to compare two HTML pages for equality, as
+whitespace sequences need to be ignored and the structural formatting can
+differ. This is even more a problem in doctests, where output is tested for
+equality and small differences in whitespace or the order of attributes can
+let a test fail. And given the verbosity of tag-based languages, it may take
+more than a quick look to find the actual differences in the doctest output.
+
+Luckily, lxml provides the ``lxml.doctestcompare`` module that supports
+relaxed comparison of XML and HTML pages and provides a readable diff in the
+output when a test fails. It is most easily used by importing the
+``usedoctest`` module in a doctest::
+
+ >>> from lxml.html import usedoctest
+
+Now, if you have a HTML document and want to compare it to an expected result
+document in a doctest, you can do the following::
+
+ >>> import lxml.html
+ >>> html = lxml.html.HTML('''\
+ ... <html><body onload="" color="white">
+ ... <p>Hi !</p>
+ ... </body></html>
+ ... ''')
+
+ >>> print lxml.html.tostring(html)
+ <html><body onload="" color="white"><p>Hi !</p></body></html>
+
+ >>> print lxml.html.tostring(html)
+ <html> <body color="white" onload=""> <p>Hi !</p> </body> </html>
+
+ >>> print lxml.html.tostring(html)
+ <html>
+ <body color="white" onload="">
+ <p>Hi !</p>
+ </body>
+ </html>
+
+In documentation, you would likely prefer the pretty printed HTML output, as
+it is the most readable. However, the three documents are equivalent from the
+point of view of an HTML tool, so the doctest will silently accept any of the
+above. This allows you to concentrate on readability in your doctests, even
+if the real output is a straight ugly HTML one-liner.
+
+Note that there is also an ``lxml.usedoctest`` module which you can import for
+XML comparisons.
+
+
+Parsing HTML
+============
+
+
+Parsing HTML fragments
+----------------------
+
+
+Creating HTML with the E-factory
+================================
+
+.. _`E-factory`: http://FIXME/
+
+lxml.html comes with a predefined HTML vocabulary for the `E-factory`_,
+originally written by Fredrik Lundh. This allows you to quickly generate HTML
+pages and fragments::
+
+ >>> from lxml.html import builder as h
+
+ >>> html = h.HTML(
+ ... h.HEAD(
+ ... h.LINK(rel="stylesheet", href="great.css", type="text/css"),
+ ... h.TITLE("Best Page Ever")
+ ... ),
+ ... h.BODY(
+ ... h.H1(h.CLASS("heading"), "Top News"),
+ ... h.P("World News only on this page", style="font-size: 200%"),
+ ... "Ah, and here's some more text, by the way.",
+ ... lxml.html.HTML("<p>... and this is a parsed fragment ...</p>")
+ ... )
+ ... )
+
+ >>> print lxml.html.tostring(html)
+ <html>
+ <head>
+ <link href="great.css" rel="stylesheet" type="text/css">
+ <title>Best Page Ever</title>
+ </head>
+ <body>
+ <h1 class="heading">Top News</h1>
+ <p style="font-size: 200%">World News only on this page</p>
+ Ah, and here's some more text, by the way.
+ <p>... and this is a parsed fragment ...</p>
+ </body>
+ </html>
+
+
+Working with links
+==================
+
+
+Cleaning up HTML
+================
+
+The module ``lxml.html.clean`` provides a ``Cleaner`` class for cleaning up
+HTML pages. It supports removing embedded or script content, special tags,
+CSS style annotations and much more.
+
+Say, you have an evil web page from an untrusted source that contains lots of
+content that upsets browsers and tries to run evil code on the client side::
+
+ >>> html = '''\
+ ... <html>
+ ... <head>
+ ... <script type="text/javascript" src="evil-site"></script>
+ ... <link rel="alternate" type="text/rss" src="evil-rss">
+ ... <style>
+ ... body {background-image: url(javascript:do_evil)};
+ ... div {color: expression(evil)};
+ ... </style>
+ ... </head>
+ ... <body onload="evil_function()">
+ ... <!-- I am interpreted for EVIL! -->
+ ... <a href="javascript:evil_function()">a link</a>
+ ... <a href="#" onclick="evil_function()">another link</a>
+ ... <p onclick="evil_function()">a paragraph</p>
+ ... <div style="display: none">secret EVIL!</div>
+ ... <object> of EVIL! </object>
+ ... <iframe src="evil-site"></iframe>
+ ... <form action="evil-site">
+ ... Password: <input type="password" name="password">
+ ... </form>
+ ... <blink>annoying EVIL!</blink>
+ ... <a href="evil-site">spam spam SPAM!</a>
+ ... <image src="evil!">
+ ... </body>
+ ... </html>'''
+
+To remove the all suspicious content from this unparsed document, use the
+``clean_html`` function.::
+
+ >>> from lxml.html.clean import clean_html
+
+ >>> print clean_html(html)
+ <html>
+ <body>
+ <div>
+ <style>/* deleted */</style>
+ <a href="">a link</a>
+ <a href="#">another link</a>
+ <p>a paragraph</p>
+ <div>secret EVIL!</div>
+ of EVIL!
+ Password:
+ annoying EVIL!
+ <a href="evil-site">spam spam SPAM!</a>
+ <img src="evil!">
+ </div>
+ </body>
+ </html>
+
+The ``Cleaner`` class supports several keyword arguments to control exactly
+which content is removed::
+
+ >>> from lxml.html.clean import Cleaner
+
+ >>> cleaner = Cleaner(page_structure=False, links=False)
+ >>> print cleaner.clean_html(html)
+ <html>
+ <head>
+ <link rel="alternate" src="evil-rss" type="text/rss">
+ <style>/* deleted */</style>
+ </head>
+ <body>
+ <a href="">a link</a>
+ <a href="#">another link</a>
+ <p>a paragraph</p>
+ <div>secret EVIL!</div>
+ of EVIL!
+ Password:
+ annoying EVIL!
+ <a href="evil-site">spam spam SPAM!</a>
+ <img src="evil!">
+ </body>
+ </html>
+
+ >>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
+ ... page_structure=False, safe_attrs_only=False)
+
+ >>> print cleaner.clean_html(html)
+ <html>
+ <head>
+ </head>
+ <body>
+ <a href="">a link</a>
+ <a href="#">another link</a>
+ <p>a paragraph</p>
+ <div>secret EVIL!</div>
+ of EVIL!
+ Password:
+ annoying EVIL!
+ <a href="evil-site" rel="nofollow">spam spam SPAM!</a>
+ <img src="evil!">
+ </body>
+ </html>
Modified: lxml/branch/html/doc/mkhtml.py
==============================================================================
--- lxml/branch/html/doc/mkhtml.py (original)
+++ lxml/branch/html/doc/mkhtml.py Thu Jul 5 21:50:03 2007
@@ -6,7 +6,8 @@
'performance.txt', 'build.txt')),
('Developing with lxml', ('tutorial.txt', 'api.txt', 'parsing.txt',
'validation.txt', 'xpathxslt.txt',
- 'objectify.txt')),
+ 'objectify.txt', 'lxmlhtml.txt',
+ 'cssselect.txt')),
('Extending lxml', ('resolvers.txt', 'extensions.txt',
'element_classes.txt', 'sax.txt', 'capi.txt')),
]
More information about the lxml-checkins
mailing list