[lxml-dev] lxml 2.1 released

Stefan Behnel stefan_ml at behnel.de
Wed Jul 9 15:27:21 CEST 2008


Hi,

lxml 2.1 finally made it to PyPI!

This is a major new release that follows the 2.0 series with a couple of
cleanups and tons of new features. The complete changelog follows below.

This is also the first version that officially supports Python 3, as released
in 3.0beta1.

Have fun,
Stefan


2.1 (2008-07-09)
================

Features added
--------------

* Smart strings can be switched off in XPath (``smart_string`` keyword
  option).

* ``lxml.html.rewrite_links()`` strips links to work around documents
  with whitespace in URL attributes.

* Pickling ``ElementTree`` objects in lxml.objectify.

* Major overhaul of ``tools/xpathgrep.py`` script.

* Pickling ``ElementTree`` objects in lxml.objectify.

* Support for parsing from file-like objects that return unicode
  strings.

* New function ``etree.cleanup_namespaces(el)`` that removes unused
  namespace declarations from a (sub)tree (experimental).

* XSLT results support the buffer protocol in Python 3.

* Polymorphic functions in ``lxml.html`` that accept either a tree or
  a parsable string will return either a UTF-8 encoded byte string, a
  unicode string or a tree, based on the type of the input.
  Previously, the result was always a byte string or a tree.

* Support for Python 2.6 and 3.0 beta.

* File name handling now uses a heuristic to convert between byte
  strings (usually filenames) and unicode strings (usually URLs).

* Parsing from a plain file object frees the GIL under Python 2.x.

* Running ``iterparse()`` on a plain file (or filename) frees the GIL
  on reading under Python 2.x.

* Conversion functions ``html_to_xhtml()`` and ``xhtml_to_html()`` in
  lxml.html (experimental).

* Most features in lxml.html work for XHTML namespaced tag names
  (experimental).

* All parse functions in lxml.html take a ``parser`` keyword argument.

* lxml.html has a new parser class ``XHTMLParser`` and a module
  attribute ``xhtml_parser`` that provide XML parsers that are
  pre-configured for the lxml.html package.

* Error logging in Schematron (requires libxml2 2.6.32 or later).

* Parser option ``strip_cdata`` for normalising or keeping CDATA
  sections.  Defaults to ``True`` as before, thus replacing CDATA
  sections by their text content.

* ``CDATA()`` factory to wrap string content as CDATA section.

* New event types 'comment' and 'pi' in ``iterparse()``.

* ``XSLTAccessControl`` instances have a property ``options`` that
  returns a dict of access configuration options.

* Constant instances ``DENY_ALL`` and ``DENY_WRITE`` on
  ``XSLTAccessControl`` class.

* Extension elements for XSLT (experimental!)

* ``Element.base`` property returns the xml:base or HTML base URL of
  an Element.

* ``docinfo.URL`` property is writable.


Bugs fixed
----------

* Custom resolvers were not used for XMLSchema includes/imports and
  XInclude processing.

* CSS selector parser dropped remaining expression after a function
  with parameters.

* Descending dot-separated classes in CSS selectors were not resolved
  correctly.

* ``ElementTree.parse()`` didn't handle target parser result.

* Potential threading problem in XInclude.

* Crash in Element class lookup classes when the __init__() method of
  the super class is not called from Python subclasses.

* A number of problems related to unicode/byte string conversion of
  filenames and error messages were fixed.

* Building on MacOS-X now passes the "flat_namespace" option to the C
  compiler, which reportedly prevents build quirks and crashes on this
  platform.

* Windows build was broken.

* Rare crash when serialising to a file object with certain encodings.

* Incorrect evaluation of ``el.find("tag[child]")``.

* Moving a subtree from a document created in one thread into a
  document of another thread could crash when the rest of the source
  document is deleted while the subtree is still in use.

* Passing an nsmap when creating an Element will no longer strip
  redundantly defined namespace URIs.  This prevented the definition
  of more than one prefix for a namespace on the same Element.

* Resolving to a filename in custom resolvers didn't work.

* lxml did not honour libxslt's second error state "STOPPED", which
  let some XSLT errors pass silently.

* Memory leak in Schematron with libxml2 >= 2.6.31.

* lxml.etree accepted non well-formed namespace prefix names.

* Hanging thread in conjunction with GTK threading.

* Crash bug in iterparse when moving elements into other documents.

* HTML elements' ``.cssselect()`` method was broken.

* ``ElementTree.find*()`` didn't accept QName objects.

* Default encoding for plain text serialisation was different from
  that of XML serialisation (UTF-8 instead of ASCII).


Other changes
-------------

* ``objectify.enableRecursiveStr()`` was removed, use
  ``objectify.enable_recursive_str()`` instead

* Speed-up when running XSLTs on documents from other threads

* Non-ASCII characters in attribute values are no longer escaped on
  serialisation.

* Passing non-ASCII byte strings or invalid unicode strings as .tag,
  namespaces, etc. will result in a ValueError instead of an
  AssertionError (just like the tag well-formedness check).

* Up to several times faster attribute access (i.e. tree traversal) in
  lxml.objectify.

* lxml should now build without problems on MacOS-X.

* If the default namespace is redundantly defined with a prefix on the
  same Element, the prefix will now be preferred for subelements and
  attributes.  This allows users to work around a problem in libxml2
  where attributes from the default namespace could serialise without
  a prefix even when they appear on an Element with a different
  namespace (i.e. they would end up in the wrong namespace).

* Major cleanup in internal ``moveNodeToDocument()`` function, which
  takes care of namespace cleanup when moving elements between
  different namespace contexts.

* New Elements created through the ``makeelement()`` method of an HTML
  parser or through lxml.html now end up in a new HTML document
  (doctype HTML 4.01 Transitional) instead of a generic XML document.
  This mostly impacts the serialisation and the availability of a DTD
  context.

* Minor API speed-ups.

* The benchmark suite now uses tail text in the trees, which makes the
  absolute numbers incomparable to previous results.

* Generating the HTML documentation now requires Pygments_, which is
  used to enable syntax highlighting for the doctest examples.

.. _Pygments: http://pygments.org/

Most long-time deprecated functions and methods were removed:

- ``etree.clearErrorLog()``, use ``etree.clear_error_log()``

- ``etree.useGlobalPythonLog()``, use
  ``etree.use_global_python_log()``

- ``etree.ElementClassLookup.setFallback()``, use
  ``etree.ElementClassLookup.set_fallback()``

- ``etree.getDefaultParser()``, use ``etree.get_default_parser()``

- ``etree.setDefaultParser()``, use ``etree.set_default_parser()``

- ``etree.setElementClassLookup()``, use
  ``etree.set_element_class_lookup()``

  Note that ``parser.setElementClassLookup()`` has not been removed
  yet, although ``parser.set_element_class_lookup()`` should be used
  instead.

- ``xpath_evaluator.registerNamespace()``, use
  ``xpath_evaluator.register_namespace()``

- ``xpath_evaluator.registerNamespaces()``, use
  ``xpath_evaluator.register_namespaces()``

- ``objectify.setPytypeAttributeTag``, use
  ``objectify.set_pytype_attribute_tag``

- ``objectify.setDefaultParser()``, use
  ``objectify.set_default_parser()``


More information about the lxml-dev mailing list