[Lxml-checkins] r44624 - in lxml/branch/html: . doc doc/html src/lxml src/lxml/tests src/lxml/tests/include tools
ianb at codespeak.net
ianb at codespeak.net
Fri Jun 29 19:05:44 CEST 2007
Author: ianb
Date: Fri Jun 29 19:05:43 2007
New Revision: 44624
Added:
lxml/branch/html/src/lxml/tests/include/
- copied from r44623, lxml/trunk/src/lxml/tests/include/
lxml/branch/html/src/lxml/tests/include/test_xinclude.xml
- copied unchanged from r44623, lxml/trunk/src/lxml/tests/include/test_xinclude.xml
lxml/branch/html/tools/
- copied from r44623, lxml/trunk/tools/
lxml/branch/html/tools/xpathgrep.py
- copied unchanged from r44623, lxml/trunk/tools/xpathgrep.py
Removed:
lxml/branch/html/Pyrex-0.9.4.1-public-api.patch
lxml/branch/html/src/lxml/tests/test_xinclude.xml
Modified:
lxml/branch/html/CHANGES.txt
lxml/branch/html/INSTALL.txt
lxml/branch/html/MANIFEST.in
lxml/branch/html/doc/FAQ.txt
lxml/branch/html/doc/compatibility.txt
lxml/branch/html/doc/html/style.css
lxml/branch/html/doc/intro.txt
lxml/branch/html/doc/main.txt
lxml/branch/html/doc/objectify.txt
lxml/branch/html/doc/parsing.txt
lxml/branch/html/doc/performance.txt
lxml/branch/html/doc/tutorial.txt
lxml/branch/html/setup.py
lxml/branch/html/setupinfo.py
lxml/branch/html/src/lxml/ElementInclude.py
lxml/branch/html/src/lxml/apihelpers.pxi
lxml/branch/html/src/lxml/builder.py
lxml/branch/html/src/lxml/etree.pyx
lxml/branch/html/src/lxml/iterparse.pxi
lxml/branch/html/src/lxml/objectify.pyx
lxml/branch/html/src/lxml/parser.pxi
lxml/branch/html/src/lxml/proxy.pxi
lxml/branch/html/src/lxml/python.pxd
lxml/branch/html/src/lxml/tests/test_elementtree.py
lxml/branch/html/src/lxml/tests/test_etree.py
lxml/branch/html/src/lxml/tests/test_htmlparser.py
lxml/branch/html/src/lxml/tests/test_objectify.py
lxml/branch/html/src/lxml/tree.pxd
lxml/branch/html/src/lxml/xmlerror.pxi
lxml/branch/html/src/lxml/xmlparser.pxd
lxml/branch/html/src/lxml/xmlschema.pxi
lxml/branch/html/version.txt
lxml/branch/html/versioninfo.py
Log:
svn merge -r44104:HEAD http://codespeak.net/svn/lxml/trunk
Modified: lxml/branch/html/CHANGES.txt
==============================================================================
--- lxml/branch/html/CHANGES.txt (original)
+++ lxml/branch/html/CHANGES.txt Fri Jun 29 19:05:43 2007
@@ -2,18 +2,18 @@
lxml changelog
==============
-Under Development
+Under development
=================
Features added
--------------
+* E-factory support for lxml.objectify (``objectify.E``)
+
* Entity support through an ``Entity`` factory and element classes. XML
parsers now have a ``resolve_entities`` keyword argument that can be set to
False to keep entities in the document.
-* ``parse()`` function in ``objectify``, corresponding to ``XML()`` etc.
-
* ``column`` field on error log entries to accompany the ``line`` field
* Error specific messages in XPath parsing and evaluation
@@ -23,75 +23,65 @@
* The regular expression functions in XPath now support passing a node-set
instead of a string
-* ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support
- adding processing instructions and comments around the root node
-
-* ``Element.attrib`` was missing ``clear()`` and ``pop()`` methods
-
-* Extended type annotation in objectify: cleaner annotation namespace setup
- plus new ``xsiannotate()`` and ``deannotate()`` functions
-
-* Support for custom Element class instantiation in lxml.sax: passing a
- ``makeelement`` function to the ElementTreeContentHandler will reuse the
- lookup context of that function
-
-* '.' represents empty ObjectPath (identity)
+* Extended type annotation in objectify: new ``xsiannotate()`` function
* EXSLT RegExp support in standard XPath (not only XSLT)
-* ``lxml.pyclasslookup`` module that can access the entire tree in read-only
- mode to help determining a suitable Element class
-
-* ``Element.values()`` to accompany the existing ``.keys()`` and ``.items()``
-
-* ``collectAttributes()`` C-function to build a list of attribute
- keys/values/items for a libxml2 node
-
Bugs fixed
----------
* ``Element.getiterator(tag)`` did not accept ``Comment`` and
``ProcessingInstruction`` as tags
+* Reference-counting bug in ``Element.attrib.pop()``
+
* The XML parser did not report undefined entities as error
* The text in exceptions raised by XML parsers, validators and XPath
evaluators now reports the first error that occurred instead of the last
-* XSLT parsing failed to pass resolver context on to imported documents
+* passing '' as XPath namespace prefix did not raise an error
-* ``ETXPath`` was missing the ``regexp`` keyword argument
+* Thread safety in XPath evaluators
-* passing '' as XPath namespace prefix did not raise an error
+Other changes
+-------------
-* passing '' as namespace prefix in nsmap could be passed through to libxml2
+* major refactoring in XPath/XSLT extension function code
-* Objectify couldn't handle prefixed XSD type names in ``xsi:type``
-* More ET compatible behaviour when writing out XML declarations or not
+1.3 (2007-06-24)
+================
-* More robust error handling in ``iterparse()``
+Features added
+--------------
-* Documents lost their top-level PIs and comments on serialisation
+* Module ``lxml.pyclasslookup`` module implements an Element class lookup
+ scheme that can access the entire tree in read-only mode to help determining
+ a suitable Element class
-* lxml.sax failed on comments and PIs. Comments are now properly ignored and
- PIs are copied.
+* Parsers take a ``remove_comments`` keyword argument that skips over comments
-* Thread safety in XPath evaluators
+* ``parse()`` function in ``objectify``, corresponding to ``XML()`` etc.
-* Raise AssertionError when passing strings containing '\0' bytes
+* ``Element.addnext(el)`` and ``Element.addprevious(el)`` methods to support
+ adding processing instructions and comments around the root node
-Other changes
--------------
+* ``Element.attrib`` was missing ``clear()`` and ``pop()`` methods
-* major refactoring in XPath/XSLT extension function code
+* Extended type annotation in objectify: cleaner annotation namespace setup
+ plus new ``deannotate()`` function
+
+* Support for custom Element class instantiation in lxml.sax: passing a
+ ``makeelement`` function to the ElementTreeContentHandler will reuse the
+ lookup context of that function
+* '.' represents empty ObjectPath (identity)
-1.3beta (2007-02-27)
-====================
+* ``Element.values()`` to accompany the existing ``.keys()`` and ``.items()``
-Features added
---------------
+* ``collectAttributes()`` C-function to build a list of attribute
+ keys/values/items for a libxml2 node
* ``DTD`` validator class (like ``RelaxNG`` and ``XMLSchema``)
@@ -109,6 +99,35 @@
Bugs fixed
----------
+* Removing Elements from a tree could make them loose their namespace
+ declarations
+
+* ``ElementInclude`` didn't honour base URL of original document
+
+* Replacing the children slice of an Element would cut off the tails of the
+ original children
+
+* ``Element.getiterator(tag)`` did not accept ``Comment`` and
+ ``ProcessingInstruction`` as tags
+
+* API functions now check incoming strings for XML conformity. Zero bytes or
+ low ASCII characters are no longer accepted (AssertionError).
+
+* XSLT parsing failed to pass resolver context on to imported documents
+
+* passing '' as namespace prefix in nsmap could be passed through to libxml2
+
+* Objectify couldn't handle prefixed XSD type names in ``xsi:type``
+
+* More ET compatible behaviour when writing out XML declarations or not
+
+* More robust error handling in ``iterparse()``
+
+* Documents lost their top-level PIs and comments on serialisation
+
+* lxml.sax failed on comments and PIs. Comments are now properly ignored and
+ PIs are copied.
+
* Possible memory leaks in namespace handling when moving elements between
documents
Modified: lxml/branch/html/INSTALL.txt
==============================================================================
--- lxml/branch/html/INSTALL.txt (original)
+++ lxml/branch/html/INSTALL.txt Fri Jun 29 19:05:43 2007
@@ -8,10 +8,12 @@
You need libxml2 and libxslt, in particular:
-* libxml 2.6.16 or later. It can be found here:
+* libxml 2.6.20 or later. It can be found here:
http://xmlsoft.org/downloads.html
-* libxslt 1.1.12 or later. It can be found here:
+ If you want to use XPath reliably, try to avoid libxml2 2.6.27.
+
+* libxslt 1.1.15 or later. It can be found here:
http://xmlsoft.org/XSLT/downloads.html
Newer versions generally contain less bugs and are therefore recommended. The
@@ -19,30 +21,31 @@
parsing horribly broken HTML. XML Schema support is also still worked on in
libxml2, so newer versions will give you better complience with the W3C spec.
-For Windows, there is a `binary distribution`_ of libxml2 and libxslt. Note
-that you need both libxml2 and libxslt, as well as iconv and zlib. You can
-then install the `binary egg distribution`_ of lxml (see below).
-.. _`binary distribution`: http://www.zlatkovic.com/libxml.en.html
-.. _`binary egg distribution`: http://cheeseshop.python.org/pypi/lxml
+Installation
+------------
-On MacOS-X 10.4, you can use the installed system libraries and the binary egg
-distribution of lxml. Note that the libxslt version on this system is older
-than the required version above. While there were not any bug reports so far,
-you may still encounter certain differences in behaviour in rare cases.
-
-If you want to build lxml from SVN, you also need Pyrex_. Please read `how to
-build lxml from source`_ in this case. If you are using a released version of
-lxml, it should come with the generated C file in the source distribution, so
-no Pyrex is needed in that case.
+If you have easy_install_, you can run the following as super-user (or
+administrator)::
+
+ easy_install lxml
+
+.. _easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall
+
+This has been reported to work on Linux, MacOS-X 10.4 and Windows, as long as
+libxml2 and libxslt are properly installed (including development packages,
+i.e. header files etc.).
-.. _Pyrex: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
-.. _`how to build lxml from source`: build.html
-Note that Pyrex up to and including version 0.9.4 has known problems when
-compiling lxml with gcc 4.0 or Python 2.4. Do not use it. If you want to
-build lxml from non-release sources, please install Pyrex version 0.9.4.1 or
-later.
+Building lxml from sources
+--------------------------
+
+If you want to build lxml from SVN you should read `how to build lxml from
+source`_ (or the file ``build.txt`` in the ``doc`` directory of the source
+tree). Both the subversion sources and the source distribution ship with an
+adapted version of Pyrex, so you do not need Pyrex installed.
+
+.. _`how to build lxml from source`: build.html
If you have read these instructions and still cannot manage to install lxml,
you can check the archives of the `mailing list`_ to see if your problem is
@@ -51,16 +54,30 @@
.. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev
-Installation
-------------
+MS Windows
+----------
-If you have easy_install_, you can run the following as super-user::
+For MS Windows, the `binary egg distribution of lxml`_ is statically built
+against the libraries, i.e. it already includes them. There is no need to
+install the external libraries if you use an official lxml build from
+cheeseshop.
+
+If you want to upgrade the libraries and/or compile lxml from sources, you
+should install a `binary distribution`_ of libxml2 and libxslt. You need both
+libxml2 and libxslt, as well as iconv and zlib.
- easy_install lxml
+.. _`binary distribution`: http://www.zlatkovic.com/libxml.en.html
+.. _`binary egg distribution of lxml`: http://cheeseshop.python.org/pypi/lxml
-.. _easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall
-This has been reported to work on Linux, MacOS-X 10.4 and Windows, as long as
-libxml2 and libxslt are properly installed. To compile and install lxml
-without easy_install, please read `how to build lxml from source`_ (or the
-file ``build.txt`` in the ``doc`` directory of the source tree).
+MacOS-X
+-------
+
+On MacOS-X 10.4, you can try to use the installed system libraries when you
+build lxml yourself. However, the library versions on this system are older
+than the required versions, so you may encounter certain differences in
+behaviour or even crashes. A number of users reported success with updated
+libraries (e.g. using fink_), but needed to set the environment variable
+``DYLD_LIBRARY_PATH`` to the directory where fink keeps the libraries.
+
+.. _fink: http://finkproject.org/
Modified: lxml/branch/html/MANIFEST.in
==============================================================================
--- lxml/branch/html/MANIFEST.in (original)
+++ lxml/branch/html/MANIFEST.in Fri Jun 29 19:05:43 2007
@@ -5,10 +5,12 @@
include MANIFEST.in version.txt
include CHANGES.txt CREDITS.txt INSTALL.txt LICENSES.txt README.txt TODO.txt
recursive-include src *.pyx *.pxd *.pxi *.py
-recursive-include src/lxml etree.c objectify.c etree.h etree_defs.h
+recursive-include src/lxml etree.c objectify.c pyclasslookup.c etree.h etree_defs.h
recursive-include src/lxml/tests *.rng *.xslt *.xml *.dtd
recursive-include benchmark *.py
recursive-include doc *.txt *.html *.css *.xml *.mgp pubkey.asc
-recursive-include Pyrex *.py
+include Pyrex/__init__.py
+recursive-include Pyrex/Compiler *.py
+recursive-include Pyrex/Distutils *.py
include doc/mkhtml.py doc/rest2html.py
exclude doc/pyrex.txt src/lxml/etree.pxi
Deleted: /lxml/branch/html/Pyrex-0.9.4.1-public-api.patch
==============================================================================
--- /lxml/branch/html/Pyrex-0.9.4.1-public-api.patch Fri Jun 29 19:05:43 2007
+++ (empty file)
@@ -1,239 +0,0 @@
-Index: Pyrex/Compiler/Nodes.py
-===================================================================
---- Pyrex/Compiler/Nodes.py (Revision 151)
-+++ Pyrex/Compiler/Nodes.py (Arbeitskopie)
-@@ -114,24 +114,28 @@
- self.generate_h_code(env, result)
-
- def generate_h_code(self, env, result):
-- public_vars_and_funcs = []
-+ public_vars = []
-+ public_funcs = []
- public_extension_types = []
- for entry in env.var_entries:
- if entry.visibility == 'public':
-- public_vars_and_funcs.append(entry)
-+ public_vars.append(entry)
- for entry in env.cfunc_entries:
- if entry.visibility == 'public':
-- public_vars_and_funcs.append(entry)
-+ public_funcs.append(entry)
- for entry in env.c_class_entries:
- if entry.visibility == 'public':
- public_extension_types.append(entry)
-- if public_vars_and_funcs or public_extension_types:
-+ if public_vars or public_funcs or public_extension_types:
- result.h_file = replace_suffix(result.c_file, ".h")
- result.i_file = replace_suffix(result.c_file, ".pxi")
- h_code = Code.CCodeWriter(result.h_file)
- i_code = Code.PyrexCodeWriter(result.i_file)
-+ header_barrier = "__HAS_PYX_" + env.module_name
-+ h_code.putln("#ifndef %s" % header_barrier)
-+ h_code.putln("#define %s" % header_barrier)
- self.generate_extern_c_macro_definition(h_code)
-- for entry in public_vars_and_funcs:
-+ for entry in public_vars:
- h_code.putln("%s %s;" % (
- Naming.extern_c_macro,
- entry.type.declaration_code(
-@@ -141,7 +145,23 @@
- for entry in public_extension_types:
- self.generate_cclass_header_code(entry.type, h_code)
- self.generate_cclass_include_code(entry.type, i_code)
-+ if public_funcs:
-+ for entry in public_funcs:
-+ h_code.putln(
-+ 'static %s;' %
-+ entry.type.declaration_code("(*%s)" % entry.cname))
-+ i_code.putln("cdef extern %s" %
-+ entry.type.declaration_code(entry.cname, pyrex = 1))
-+ h_code.putln(
-+ "static struct {char *s; void **p;} _%s_API[] = {" %
-+ env.module_name)
-+ for entry in public_funcs:
-+ h_code.putln('{"%s", &%s},' % (entry.cname, entry.cname))
-+ h_code.putln("{0, 0}")
-+ h_code.putln("};")
-+ self.generate_c_api_import_code(env, h_code)
- h_code.putln("PyMODINIT_FUNC init%s(void);" % env.module_name)
-+ h_code.putln("#endif /* %s */" % header_barrier)
-
- def generate_cclass_header_code(self, type, h_code):
- #h_code.putln("extern DL_IMPORT(PyTypeObject) %s;" % type.typeobj_cname)
-@@ -180,6 +200,7 @@
- self.body.generate_function_definitions(env, code)
- self.generate_interned_name_table(env, code)
- self.generate_py_string_table(env, code)
-+ self.generate_c_api_table(env, code)
- self.generate_typeobj_definitions(env, code)
- self.generate_method_table(env, code)
- self.generate_filename_init_prototype(code)
-@@ -437,10 +458,12 @@
- dll_linkage = None
- header = entry.type.declaration_code(entry.cname,
- dll_linkage = dll_linkage)
-- if entry.visibility <> 'private':
-+ if entry.visibility == 'private':
-+ storage_class = "static "
-+ elif entry.visibility == 'extern':
- storage_class = "%s " % Naming.extern_c_macro
- else:
-- storage_class = "static "
-+ storage_class = ""
- code.putln("%s%s; /*proto*/" % (
- storage_class,
- header))
-@@ -1090,6 +1113,63 @@
- code.putln(
- "};")
-
-+ def generate_c_api_table(self, env, code):
-+ public_funcs = []
-+ for entry in env.cfunc_entries:
-+ if entry.visibility == 'public':
-+ public_funcs.append(entry.cname)
-+ if public_funcs:
-+ env.use_utility_code(c_api_import_code);
-+ code.putln(
-+ "static __Pyx_CApiTabEntry %s[] = {" %
-+ Naming.c_api_tab_cname)
-+ public_funcs.sort()
-+ for entry_cname in public_funcs:
-+ code.putln('{"%s", %s},' % (entry_cname, entry_cname))
-+ code.putln(
-+ "{0, 0}")
-+ code.putln(
-+ "};")
-+
-+ def generate_c_api_import_code(self, env, h_code):
-+ # this is written to the header file!
-+ h_code.put("""
-+ /* Return -1 and set exception on error, 0 on success. */
-+ static int
-+ import_%(name)s(PyObject *module)
-+ {
-+ if (module != NULL) {
-+ PyObject *c_api_init = PyObject_GetAttrString(
-+ module, "_import_c_api");
-+ if (!c_api_init)
-+ return -1;
-+ if (PyCObject_Check(c_api_init))
-+ {
-+ int (*init)(struct {const char *s; const void **p;}*) =
-+ PyCObject_AsVoidPtr(c_api_init);
-+ if (!init) {
-+ PyErr_SetString(PyExc_RuntimeError,
-+ "module returns NULL pointer for C API call");
-+ return -1;
-+ }
-+ init(_%(name)s_API);
-+ }
-+ Py_DECREF(c_api_init);
-+ }
-+ return 0;
-+ }
-+ """.replace('\n ', '\n') % {'name' : env.module_name})
-+
-+ def generate_c_api_init_code(self, env, code):
-+ public_funcs = []
-+ for entry in env.cfunc_entries:
-+ if entry.visibility == 'public':
-+ public_funcs.append(entry)
-+ if public_funcs:
-+ code.putln('if (__Pyx_InitCApi(%s) < 0) %s' % (
-+ Naming.module_cname,
-+ code.error_goto(self.pos)))
-+
- def generate_filename_init_prototype(self, code):
- code.putln("");
- code.putln("static void %s(void); /*proto*/" % Naming.fileinit_cname)
-@@ -1109,6 +1189,8 @@
- self.generate_intern_code(env, code)
- #code.putln("/*--- String init code ---*/")
- self.generate_string_init_code(env, code)
-+ #code.putln("/*--- External C API setup code ---*/")
-+ self.generate_c_api_init_code(env, code)
- #code.putln("/*--- Global init code ---*/")
- self.generate_global_init_code(env, code)
- #code.putln("/*--- Type import code ---*/")
-@@ -1862,10 +1944,12 @@
- dll_linkage = None
- header = self.return_type.declaration_code(entity,
- dll_linkage = dll_linkage)
-- if self.visibility <> 'private':
-+ if self.visibility == 'private':
-+ storage_class = "static "
-+ elif self.visibility == 'extern':
- storage_class = "%s " % Naming.extern_c_macro
- else:
-- storage_class = "static "
-+ storage_class = ""
- code.putln("%s%s {" % (
- storage_class,
- header))
-@@ -3550,6 +3634,7 @@
-
- utility_function_predeclarations = \
- """
-+typedef struct {const char *s; const void **p;} __Pyx_CApiTabEntry; /*proto*/
- typedef struct {PyObject **p; char *s;} __Pyx_InternTabEntry; /*proto*/
- typedef struct {PyObject **p; char *s; long n;} __Pyx_StringTabEntry; /*proto*/
- static PyObject *__Pyx_UnpackItem(PyObject *, Py_ssize_t); /*proto*/
-@@ -3572,6 +3657,8 @@
- static PyObject *__Pyx_CreateClass(PyObject *bases, PyObject *dict, PyObject *name, char *modname); /*proto*/
- static int __Pyx_InternStrings(__Pyx_InternTabEntry *t); /*proto*/
- static int __Pyx_InitStrings(__Pyx_StringTabEntry *t); /*proto*/
-+static int __Pyx_InitCApi(PyObject *module); /*proto*/
-+static int __Pyx_ImportModuleCApi(__Pyx_CApiTabEntry *t); /*proto*/
- """
-
- get_name_predeclaration = \
-@@ -4056,3 +4143,37 @@
- """;
-
- #------------------------------------------------------------------------------------
-+
-+c_api_import_code = \
-+"""
-+static int __Pyx_ImportModuleCApi(__Pyx_CApiTabEntry *t) {
-+ __Pyx_CApiTabEntry *api_t;
-+ while (t->s) {
-+ if (*t->s == '\0')
-+ continue; /* shortcut for erased string entries */
-+ api_t = %(API_TAB)s;
-+ while ((api_t->s) && (strcmp(api_t->s, t->s) < 0))
-+ ++api_t;
-+ if ((!api_t->p) || (strcmp(api_t->s, t->s) != 0)) {
-+ PyErr_Format(PyExc_ValueError,
-+ "Unknown function name in C API: %%s", t->s);
-+ return -1;
-+ }
-+ *t->p = api_t->p;
-+ ++t;
-+ }
-+ return 0;
-+}
-+
-+static int __Pyx_InitCApi(PyObject *module) {
-+ int result;
-+ PyObject* cobj = PyCObject_FromVoidPtr(&__Pyx_ImportModuleCApi, NULL);
-+ if (!cobj)
-+ return -1;
-+
-+ result = PyObject_SetAttrString(module, "_import_c_api", cobj);
-+ Py_DECREF(cobj);
-+ return result;
-+}
-+""" % {'API_TAB' : Naming.c_api_tab_cname}
-+#------------------------------------------------------------------------------------
-Index: Pyrex/Compiler/Naming.py
-===================================================================
---- Pyrex/Compiler/Naming.py (Revision 151)
-+++ Pyrex/Compiler/Naming.py (Arbeitskopie)
-@@ -50,5 +50,6 @@
- self_cname = pyrex_prefix + "self"
- stringtab_cname = pyrex_prefix + "string_tab"
- vtabslot_cname = pyrex_prefix + "vtab"
-+c_api_tab_cname = pyrex_prefix + "c_api_tab"
-
- extern_c_macro = pyrex_prefix.upper() + "EXTERN_C"
Modified: lxml/branch/html/doc/FAQ.txt
==============================================================================
--- lxml/branch/html/doc/FAQ.txt (original)
+++ lxml/branch/html/doc/FAQ.txt Fri Jun 29 19:05:43 2007
@@ -6,8 +6,8 @@
:description: Frequently Asked Questions about lxml (FAQ)
:keywords: lxml, lxml.etree, FAQ, frequently asked questions
-
-See also the notes on compatibility_ to ElementTree_.
+Frequently asked questions on lxml. See also the notes on compatibility_ to
+ElementTree_.
.. _compatibility: compatibility.html
.. _ElementTree: http://effbot.org/zone/element-index.htm
@@ -18,30 +18,32 @@
1.1 Is there a tutorial?
1.2 Where can I find more documentation about lxml?
1.3 What standards does lxml implement?
- 1.4 Where are the Windows binaries?
- 1.5 What is the difference between lxml.etree and lxml.objectify?
- 1.6 How can I make my application run faster?
- 1.7 Why do I get errors about missing UCS4 symbols when installing lxml?
- 2 Contributing
- 2.1 Why is lxml not written in Python?
- 2.2 How can I contribute?
- 3 Bugs
- 3.1 My application crashes! Why does lxml.etree do that?
- 3.2 I think I have found a bug in lxml. What should I do?
- 4 Threading
- 4.1 Can I use threads to concurrently access the lxml API?
- 4.2 Does my program run faster if I use threads?
- 4.3 Would my single-threaded program run faster if I turned off threading?
- 5 Parsing and Serialisation
- 5.1 Why doesn't the ``pretty_print`` option reformat my XML output?
- 5.2 Why can't lxml parse my XML from unicode strings?
- 5.3 What is the difference between str(xslt(doc)) and xslt(doc).write() ?
- 5.4 Why can't I just delete parents or clear the root node in iterparse()?
- 6 XPath and Document Traversal
- 6.1 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
- 6.2 Why doesn't ``findall()`` support full XPath expressions?
- 6.3 How can I find out which namespace prefixes are used in a document?
- 6.4 How can I specify a default namespace for XPath expressions?
+ 1.4 What is the difference between lxml.etree and lxml.objectify?
+ 1.5 How can I make my application run faster?
+ 2 Installation
+ 2.1 Which version of libxml2 and libxslt should I use or require?
+ 2.2 Where are the Windows binaries?
+ 2.3 Why do I get errors about missing UCS4 symbols when installing lxml?
+ 3 Contributing
+ 3.1 Why is lxml not written in Python?
+ 3.2 How can I contribute?
+ 4 Bugs
+ 4.1 My application crashes!
+ 4.2 I think I have found a bug in lxml. What should I do?
+ 5 Threading
+ 5.1 Can I use threads to concurrently access the lxml API?
+ 5.2 Does my program run faster if I use threads?
+ 5.3 Would my single-threaded program run faster if I turned off threading?
+ 6 Parsing and Serialisation
+ 6.1 Why doesn't the ``pretty_print`` option reformat my XML output?
+ 6.2 Why can't lxml parse my XML from unicode strings?
+ 6.3 What is the difference between str(xslt(doc)) and xslt(doc).write() ?
+ 6.4 Why can't I just delete parents or clear the root node in iterparse()?
+ 7 XPath and Document Traversal
+ 7.1 What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
+ 7.2 Why doesn't ``findall()`` support full XPath expressions?
+ 7.3 How can I find out which namespace prefixes are used in a document?
+ 7.4 How can I specify a default namespace for XPath expressions?
General Questions
@@ -50,10 +52,16 @@
Is there a tutorial?
--------------------
-There is a `tutorial for ElementTree`_ which also works for ``lxml.etree``.
+Read the `lxml.etree Tutorial`_. While this is still work in progress (just
+as any good documentation), it provides an overview of the most important
+concepts in ``lxml.etree``. If you want to help out, the tutorial is a very
+good place to start.
+
+There is also a `tutorial for ElementTree`_ which works for ``lxml.etree``.
The `API documentation`_ also contains many examples for ``lxml.etree``. To
learn using ``lxml.objectify``, read the `objectify documentation`_.
+.. _`lxml.etree Tutorial`: tutorial.html
.. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
.. _`API documentation`: api.html
.. _`objectify documentation`: objectify.html
@@ -83,7 +91,7 @@
strictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ tests
from the OASIS XML Tests Suite.
-lxml currently supports libxml2 2.6.16 or later, which has even better support
+lxml currently supports libxml2 2.6.20 or later, which has even better support
for various XML standards. Some of the more important ones are: HTML, XML
namespaces, XPath, XInclude, XSLT, XML catalogs, canonical XML, RelaxNG,
XML:ID. Support for XML Schema and Schematron is currently incomplete in
@@ -91,32 +99,6 @@
supports loading documents through HTTP and FTP.
-Where are the Windows binaries?
--------------------------------
-
-Short answer: If you want to contribute a binary build, we are happy to put it
-up on the Cheeseshop.
-
-Long answer: Two of the bigger problems with the Windows system are the lack
-of a pre-installed standard compiler and the missing package management. Both
-make it non-trivial to build lxml on this platform. We are trying hard to
-make lxml as platform-independent as possible and it is regularly tested on
-Windows systems. However, we currently cannot provide Windows binary
-distributions ourselves.
-
-From time to time, users of different environments kindly contribute binary
-builds of lxml, most frequently for Windows or Mac-OS X. We put these on the
-Cheeseshop to make it as easy as possible for others to use lxml on their
-platform.
-
-If there is not currently a binary distribution of the most recent lxml
-release for your platform available from the Cheeseshop, please look through
-the older versions to see if they provide a binary build. This is done by
-appending the version number to the cheeseshop URL, e.g.:
-
- http://cheeseshop.python.org/pypi/lxml/1.1.2
-
-
What is the difference between lxml.etree and lxml.objectify?
-------------------------------------------------------------
@@ -159,6 +141,63 @@
.. _threading: #threading
+Installation
+============
+
+Which version of libxml2 and libxslt should I use or require?
+-------------------------------------------------------------
+
+It really depends on your application, but the rule of thumb is: more recent
+versions contain less bugs and provide more features.
+
+* Try to use versions of both libraries that were released together. At least
+ the libxml2 version should not be older than the libxslt version.
+
+* If you use XML Schema or Schematron which are still under development, the
+ most recent version of libxml2 is usually a good bet.
+
+* The same applies to XPath, where a substantial number of bugs and memory
+ leaks were fixed over time. If you encounter crashes or memory leaks in
+ XPath applications, try a more recent version of libxml2.
+
+* For parsing and fixing broken HTML, lxml requires at least libxml2 2.6.21.
+
+* For the normal tree handling, however, any libxml2 version starting with
+ 2.6.20 should do.
+
+Read the `release notes of libxml2`_ and the `release notes of libxslt`_ to
+see when (or if) a specific bug has been fixed.
+
+.. _`release notes of libxml2`: http://xmlsoft.org/news.html
+.. _`release notes of libxslt`: http://xmlsoft.org/XSLT/news.html
+
+
+Where are the Windows binaries?
+-------------------------------
+
+Short answer: If you want to contribute a binary build, we are happy to put it
+up on the Cheeseshop.
+
+Long answer: Two of the bigger problems with the Windows system are the lack
+of a pre-installed standard compiler and the missing package management. Both
+make it non-trivial to build lxml on this platform. We are trying hard to
+make lxml as platform-independent as possible and it is regularly tested on
+Windows systems. However, we currently cannot provide Windows binary
+distributions ourselves.
+
+From time to time, users of different environments kindly contribute binary
+builds of lxml, most frequently for Windows or Mac-OS X. We put these on the
+Cheeseshop to make it as easy as possible for others to use lxml on their
+platform.
+
+If there is not currently a binary distribution of the most recent lxml
+release for your platform available from the Cheeseshop, please look through
+the older versions to see if they provide a binary build. This is done by
+appending the version number to the cheeseshop URL, e.g.:
+
+ http://cheeseshop.python.org/pypi/lxml/1.1.2
+
+
Why do I get errors about missing UCS4 symbols when installing lxml?
--------------------------------------------------------------------
@@ -228,6 +267,11 @@
.. _ReST: http://docutils.sourceforge.net/rst.html
.. _`text files`: http://codespeak.net/svn/lxml/trunk/doc/
+* help with the tutorial. A tutorial is the most important stating point for
+ new users, so it is important for us to provide an easy to understand guide
+ into lxml. As allo documentation, the tutorial is work in progress, so we
+ appreciate every helping hand.
+
* improve the docstrings. lxml uses docstrings to support Python's integrated
online ``help()`` function. However, sometimes these are not sufficient to
grasp the details of the function in question. If you find such a place,
@@ -238,45 +282,68 @@
Bugs
====
-My application crashes! Why does lxml.etree do that?
-----------------------------------------------------
+My application crashes!
+-----------------------
One of the goals of lxml is "no segfaults", so if there is no clear warning in
the documentation that you were doing something potentially harmful, you have
found a bug and we would like to hear about it. Please report this bug to the
`mailing list`_. See the next section on how to do that.
+However, there are a few things to try first, to make sure the problem is
+really within lxml (or libxml2 or libxslt):
-I think I have found a bug in lxml. What should I do?
------------------------------------------------------
-
-a) First, you should look at the `current developer changelog`_ to see if this
- is a known problem that has already been fixed in the SVN trunk.
+a) If your application (or e.g. your web container) uses threads, please see
+ the FAQ section on threading to check if you touch on one of the
+ potential pitfalls.
+
+b) If you are on Mac-OS X, make sure lxml uses the correct libraries. If you
+ have updated the old system libraries (e.g. through fink), this is best
+ achieved by building lxml statically to prevent the different library
+ versions from interfering. If you choose to use a dynamically linked
+ version, make sure the ``DYLD_LIBRARY_PATH`` environment variable
+ contains the directory where you installed the libraries.
+
+In any case, try to reproduce the problem with the latest versions of
+libxml2 and libxslt. From time to time, bugs and race conditions are found
+in these libraries, so a more recent version might already contain a fix for
+your problem.
- .. _`current developer changelog`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt
-b) If you are using threads, please see the following section to check if
- you touch on one of the potential pitfalls.
+I think I have found a bug in lxml. What should I do?
+-----------------------------------------------------
-c) Try to reproduce the problem with the latest versions of libxml2 and
- libxslt. From time to time, bugs and race conditions are found in these
- libraries, so a more recent version might already contain a fix for your
- problem.
-
-d) Otherwise, we would really like to hear about it. Please report it to the
- `mailing list`_ so that we can fix it. It is very helpful in this case if
- you can come up with a short code snippet that demonstrates your problem.
- Please also report the version of lxml, libxml2 and libxslt that you are
- using by calling this::
-
- from lxml import etree
- print "lxml.etree: ", etree.LXML_VERSION
- print "libxml used: ", etree.LIBXML_VERSION
- print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION
- print "libxslt used: ", etree.LIBXSLT_VERSION
- print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
+First, you should look at the `current developer changelog`_ to see if this
+is a known problem that has already been fixed in the SVN trunk since the
+release you are using.
+
+.. _`current developer changelog`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt
+
+Also, the 'crash' section above has a few good advices what to try to see if
+the problem is really in lxml - and not in your setup. Believe it or not,
+that happens more often than you might think, especially when old libraries
+or even multiple library versions are installed.
+
+You should always try to reproduce the problem with the latest versions of
+libxml2 and libxslt - and make sure they are used (``lxml.etree`` can tell
+you what it runs with, see below).
+
+Otherwise, we would really like to hear about it. Please report it to the
+`mailing list`_ so that we can fix it. It is very helpful in this case if
+you can come up with a short code snippet that demonstrates your problem.
+If others can reproduce and see the problem, it is much easier for them to
+fix it - and maybe even easier for you to describe it and get people
+convinced that it really is a problem to fix. Please also report the
+version of lxml, libxml2 and libxslt that you are using by calling this::
+
+ from lxml import etree
+ print "lxml.etree: ", etree.LXML_VERSION
+ print "libxml used: ", etree.LIBXML_VERSION
+ print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION
+ print "libxslt used: ", etree.LIBXSLT_VERSION
+ print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
- .. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev
+.. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev
Threading
Modified: lxml/branch/html/doc/compatibility.txt
==============================================================================
--- lxml/branch/html/doc/compatibility.txt (original)
+++ lxml/branch/html/doc/compatibility.txt Fri Jun 29 19:05:43 2007
@@ -1,3 +1,4 @@
+=============================
lxml.etree versus ElementTree
=============================
@@ -25,12 +26,8 @@
# use
from lxml import etree as ElementTree
-* Some minor parts of the API of ElementTree have not yet been implemented and
- are thus missing in lxml.etree. Feel free to help out!
-
-* Then again, lxml.etree offers a lot more functionality, such as
- XPath, XSLT, Relax NG, and XML Schema support, which (c)ElementTree
- does not offer.
+* lxml.etree offers a lot more functionality, such as XPath, XSLT, Relax NG,
+ and XML Schema support, which (c)ElementTree does not offer.
* etree has a different idea about Python unicode strings than ElementTree.
In most parts of the API, ElementTree uses plain strings and unicode strings
@@ -77,32 +74,40 @@
<c><b/></c>
- Unfortunately this is a rather fundamental difference in behavior, which
- will be hard to solve. It won't affect some applications, but if you want
- to port code you must unfortunately make sure that it doesn't.
+ Unfortunately this is a rather fundamental difference in behavior, which is
+ hard to change. It won't affect some applications, but if you want to port
+ code you must unfortunately make sure that it doesn't affect yours.
+
+* etree allows navigation to the parent of a node by the ``getparent()``
+ method and to the siblings by calling ``getnext()`` and ``getprevious()``.
+ This is not possible in ElementTree as the underlying tree model does not
+ have this information.
* When trying to set a subelement using __setitem__ that is in fact not an
Element but some other object, etree raises a TypeError, and ElementTree
raises an AssertionError. This also applies to some other places of the
- API. In general, etree tries to avoid AssertionErrors in favour of being
+ API. In general, etree tries to avoid AssertionErrors in favour of being
more specific about the reason for the exception.
-* When parsing fails in ``iterparse()``, ElementTree raises an ExpatError
- instead of a SyntaxError. lxml.etree follows the other parts of the parser
- API and raises an (XML)SyntaxError.
+* When parsing fails in ``iterparse()``, ElementTree raises a low-level
+ ExpatError instead of a SyntaxError as the other parsers. lxml.etree
+ follows the other parts of the parser API and raises an (XML)SyntaxError.
* The ``iterparse()`` function in lxml is implemented based on the libxml2
- parser. This means that modifications of the document root or the ancestors
- of the current element during parsing can irritate the parser and even
- segfault. While this is not a problem in the Python object structure used
- by ElementTree, the C tree underlying lxml suffers from it. The golden rule
- for ``iterparse()`` on lxml therefore is: do not touch anything that will
- have to be touched again by the parser later on. See the lxml API
- documentation on this.
+ parser and tree generator. This means that modifications of the document
+ root or the ancestors of the current element during parsing can irritate the
+ parser and even segfault. While this is not a problem in the Python object
+ structure used by ElementTree, the C tree underlying lxml suffers from it.
+ The golden rule for ``iterparse()`` on lxml therefore is: do not touch
+ anything that will have to be touched again by the parser later on. See the
+ lxml parser documentation on this.
* ElementTree ignores comments and processing instructions when parsing XML,
while etree will read them in and treat them as Comment or
- ProcessingInstruction elements respectively.
+ ProcessingInstruction elements respectively. This is especially visible
+ where comments are found inside text content, which is then split by the
+ Comment element. You can disable this behaviour by passing the boolean
+ ``remove_comments`` keyword argument to the parser you use.
* ElementTree has a bug when serializing an empty Comment (no text argument
given) to XML, etree serializes this successfully.
@@ -113,18 +118,19 @@
* ElementTree merges the target of a processing instruction into ``PI.text``,
while lxml.etree puts it into the ``.target`` property and leaves it out of
- the ``.text`` property.
+ the ``.text`` property. The ``pi.text`` in ElementTree therefore
+ correspondents to ``pi.target + " " + pi.text`` in lxml.etree.
* Because etree is built on top of libxml2, which is namespace prefix aware,
etree preserves namespaces declarations and prefixes while ElementTree tends
to come up with its own prefixes (ns0, ns1, etc). When no namespace prefix
- is given however, etree creates ElementTree style prefixes as well.
+ is given, however, etree creates ElementTree style prefixes as well.
* etree has a 'prefix' attribute (read-only) on elements giving the Element's
prefix, if this is known, and None otherwise (in case of no namespace at
all, or default namespace).
- etree further allows passing an 'nsmap' dictionary to the Element and
+* etree further allows passing an 'nsmap' dictionary to the Element and
SubElement element factories to explicitly map namespace prefixes to
namespace URIs. These will be translated into namespace declarations on
that element. This means that in the probably rare case that you need to
@@ -132,13 +138,9 @@
ElementTree, you cannot pass it as a keyword argument to the Element and
SubElement factories directly.
-* etree elements can be copied using copy.deepcopy() and copy.copy(), just
- like ElementTree's. copy.copy() however does *not* create a shallow copy
- where elements are shared between trees, as this makes no sense in the
- context of libxml2 trees. Note that lxml can deep-copy trees considerably
- faster than ElementTree.
-
-* etree allows navigation to the parent of a node by the ``getparent()``
- method and to the siblings by calling ``getnext()`` and ``getprevious()``.
- This is not possible in ElementTree as the underlying tree model does not
- have this information.
+* etree elements can be copied using ``copy.deepcopy()`` and ``copy.copy()``,
+ just like ElementTree's. However, ``copy.copy()`` does *not* create a
+ shallow copy where elements are shared between trees, as this makes no sense
+ in the context of libxml2 trees. Note that lxml can deep-copy trees
+ considerably faster than ElementTree, so a deep copy might still be fast
+ enough to replace a shallow copy in your case.
Modified: lxml/branch/html/doc/html/style.css
==============================================================================
--- lxml/branch/html/doc/html/style.css (original)
+++ lxml/branch/html/doc/html/style.css Fri Jun 29 19:05:43 2007
@@ -205,6 +205,12 @@
font-style: italic;
}
+div.line-block {
+ font-family: Times, "Times New Roman", serif;
+ text-align: center;
+ font-size: 140%;
+}
+
code {
color: Black;
background-color: #cccccc;
Modified: lxml/branch/html/doc/intro.txt
==============================================================================
--- lxml/branch/html/doc/intro.txt (original)
+++ lxml/branch/html/doc/intro.txt Fri Jun 29 19:05:43 2007
@@ -14,21 +14,20 @@
To explain the motto:
-"Programming with libxml2 is like the thrilling embrace of an exotic
-stranger. It seems to have the potential to fulfill your wildest
-dreams, but there's a nagging voice somewhere in your head warning you
-that you're about to get screwed in the worst way." (`a quote by Mark
-Pilgrim`_)
-
-Mark Pilgrim was describing in particular the experience a Python
-programmer has when dealing with libxml2. libxml2's default Python
-bindings are fast, thrilling, powerful, and your code might fail in
-some horrible way that you really shouldn't have to worry about when
-writing Python code. lxml tries to combine the power of libxml2 with
-the ease of use of Python.
+"Programming with libxml2 is like the thrilling embrace of an exotic stranger.
+It seems to have the potential to fulfill your wildest dreams, but there's a
+nagging voice somewhere in your head warning you that you're about to get
+screwed in the worst way." (`a quote by Mark Pilgrim`_)
+
+Mark Pilgrim was describing in particular the experience a Python programmer
+has when dealing with libxml2. The default Python bindings of libxml2 are
+fast, thrilling, powerful, and your code might fail in some horrible way that
+you really shouldn't have to worry about when writing Python code. lxml
+combines the power of libxml2 with the ease of use of Python.
.. _`a quote by Mark Pilgrim`: http://diveintomark.org/archives/2004/02/18/libxml2
+
Aims
----
@@ -36,6 +35,8 @@
* Standards-compliant XML support.
+* Support for (broken) HTML.
+
* Full-featured.
* Actively maintained by XML experts.
@@ -46,8 +47,9 @@
.. _libxslt: http://xmlsoft.org/XSLT
-These libraries already ship with Python bindings, but these Python
-bindings have problems. In particular:
+
+These libraries already ship with Python bindings, but these Python bindings
+mimic the C-level interface. This yields a number of problems:
* very low level and C-ish (not Pythonic).
@@ -55,12 +57,13 @@
* UTF-8 in API, instead of Python unicode strings.
-* can cause segfaults from Python.
+* Can easily cause segfaults from Python.
+
+* Require manual memory management!
-* have to do manual memory management!
-lxml is a new Python binding for libxml2 and libxslt, completely
-independent from these existing Python bindings. Its aim:
+lxml is a new Python binding for libxml2 and libxslt, completely independent
+from these existing Python bindings. Its aims:
* Pythonic API.
@@ -72,9 +75,8 @@
* No manual memory management!
-lxml aims to provide a Pythonic API by following as much as possible
-the `ElementTree API`_. We're trying to avoid having to invent too
-many new APIs, or you having to learn new things -- XML is complicated
-enough.
+lxml aims to provide a Pythonic API by following as much as possible the
+`ElementTree API`_. We're trying to avoid inventing too many new APIs, or you
+having to learn new things -- XML is complicated enough.
.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
Modified: lxml/branch/html/doc/main.txt
==============================================================================
--- lxml/branch/html/doc/main.txt (original)
+++ lxml/branch/html/doc/main.txt Fri Jun 29 19:05:43 2007
@@ -1,7 +1,15 @@
lxml
====
-.. contents::
+.. meta::
+ :description: lxml - the most feature-rich and easy-to-use library for working with XML and HTML in the Python language
+ :keywords: lxml, etree, objectify, Python, XML, HTML
+
+| lxml is the most feature-rich
+| and easy-to-use library
+| for working with XML and HTML
+| in the Python language.
+
..
1 Introduction
2 Documentation
@@ -14,9 +22,11 @@
Introduction
------------
-lxml is a Pythonic binding for the libxml2_ and libxslt_ libraries. See the
-introduction_ for more information about background and goals. Some common
-questions are answered in the FAQ_.
+lxml is a Pythonic binding for the libxml2_ and libxslt_ libraries. It is
+unique in that it combines the speed and feature completeness of these
+libraries with the simplicity of a native Python API. See the introduction_
+for more information about background and goals. Some common questions are
+answered in the FAQ_.
.. _libxml2: http://xmlsoft.org
.. _libxslt: http://xmlsoft.org/XSLT
@@ -119,11 +129,9 @@
.. _`lxml at the Python cheeseshop`: http://cheeseshop.python.org/pypi/lxml/
.. _`this key`: pubkey.asc
-The latest version is `lxml 1.3beta`_, released 2007-02-27 (`changes for 1.3beta`_).
+The latest version is `lxml 1.3`_, released 2007-06-24 (`changes for 1.3`_).
`Older versions`_ are listed below.
-.. _`lxml 1.3beta`: lxml-1.3beta.tgz
-.. _`CHANGES for 1.3beta`: changes-1.3beta.html
.. _`Older versions`: #old-versions
Please take a look at the `installation instructions`_!
@@ -150,16 +158,23 @@
Questions? Suggestions? Code to contribute? We have a `mailing list`_.
+You can search the archive with Gmane_ or Google_.
+
.. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev
+.. _Gmane: http://blog.gmane.org/gmane.comp.python.lxml.devel
+.. _Google: http://www.google.com/webhp?q=site:codespeak.net/mailman/listinfo/lxml-dev%20
License
-------
-The lxml library is shipped under a BSD license. libxml2 and libxslt2
-itself are shipped under the MIT license. There should therefore be no
+The lxml library is shipped under a `BSD license`_. libxml2 and libxslt2
+itself are shipped under the `MIT license`_. There should therefore be no
obstacle to using lxml in your codebase.
+.. _`BSD license`: http://codespeak.net/svn/lxml/trunk/doc/licenses/BSD.txt
+.. _`MIT license`: http://www.opensource.org/licenses/mit-license.html
+
Old Versions
------------
@@ -200,6 +215,7 @@
* `lxml 0.5`_, released 2005-04-08
+.. _`lxml 1.3`: lxml-1.3.tgz
.. _`lxml 1.2.1`: lxml-1.2.1.tgz
.. _`lxml 1.2`: lxml-1.2.tgz
.. _`lxml 1.1.2`: lxml-1.1.2.tgz
@@ -219,6 +235,7 @@
.. _`lxml 0.5.1`: lxml-0.5.1.tgz
.. _`lxml 0.5`: lxml-0.5.tgz
+.. _`CHANGES for 1.3`: changes-1.3.html
.. _`changes for 1.2.1`: changes-1.2.1.html
.. _`changes for 1.2`: changes-1.2.html
.. _`changes for 1.1.2`: changes-1.1.2.html
Modified: lxml/branch/html/doc/objectify.txt
==============================================================================
--- lxml/branch/html/doc/objectify.txt (original)
+++ lxml/branch/html/doc/objectify.txt Fri Jun 29 19:05:43 2007
@@ -267,6 +267,28 @@
notB
+Tree generation with the E-factory
+----------------------------------
+
+To simplify the generation of trees even further, you can use the E-factory::
+
+ >>> E = objectify.E
+ >>> root = E.root(
+ ... E.a(5),
+ ... E.b(6.1),
+ ... E.c(True),
+ ... E.d("how", tell="me")
+ ... )
+
+ >>> print etree.tostring(root, pretty_print=True)
+ <root>
+ <a>5</a>
+ <b>6.1</b>
+ <c>true</c>
+ <d tell="me">how</d>
+ </root>
+
+
Namespace handling
------------------
Modified: lxml/branch/html/doc/parsing.txt
==============================================================================
--- lxml/branch/html/doc/parsing.txt (original)
+++ lxml/branch/html/doc/parsing.txt Fri Jun 29 19:05:43 2007
@@ -1,9 +1,10 @@
-=====================
-Parsing XML with lxml
-=====================
-
-lxml provides a very simple and powerful API for parsing XML. It supports
-one-step parsing as well as step-by-step parsing using an event-driven API.
+==============================
+Parsing XML and HTML with lxml
+==============================
+
+lxml provides a very simple and powerful API for parsing XML and HTML. It
+supports one-step parsing as well as step-by-step parsing using an
+event-driven API (currently only for XML).
.. contents::
..
@@ -64,6 +65,10 @@
* remove_blank_text - discard blank text nodes between tags
+* remove_comments - discard comments
+
+* compact - use compact storage for short text content (on by default)
+
Parsing HTML
------------
Modified: lxml/branch/html/doc/performance.txt
==============================================================================
--- lxml/branch/html/doc/performance.txt (original)
+++ lxml/branch/html/doc/performance.txt Fri Jun 29 19:05:43 2007
@@ -466,8 +466,11 @@
Since then, lxml has matured a lot and has gotten much faster. The iterparse
variant now runs in 0.14 seconds, and if you remove the ``v.clear()``, it is
-even a little faster (which isn't the case for cElementTree). When you move
-the whole thing to a pure XPath implementation, it will look like this::
+even a little faster (which isn't the case for cElementTree).
+
+One of the many great tools in lxml is XPath, a swiss army knife for finding
+things in XML documents. It is possible to move the whole thing to a pure
+XPath implementation, which looks like this::
def bench_lxml_xpath_all():
tree = etree.parse("ot.xml")
@@ -523,6 +526,11 @@
started with ``getiterator("v")`` or ``iterparse()``. Either of them would
already have been the most efficient, depending on which library is used.
+* It's important to know your tool. lxml and cElementTree are both very fast
+ libraries, but they do not have the same performance characteristics. The
+ fastest solution in one library can be comparatively slow in the other. If
+ you optimise, optimise for the specific target platform.
+
* It's not always worth optimising. After all that hassle we got from 0.12
seconds for the initial implementation to 0.11 seconds. Switching over to
cElementTree and writing an ``iterparse()`` based version would have given
Modified: lxml/branch/html/doc/tutorial.txt
==============================================================================
--- lxml/branch/html/doc/tutorial.txt (original)
+++ lxml/branch/html/doc/tutorial.txt Fri Jun 29 19:05:43 2007
@@ -31,8 +31,8 @@
>>> from lxml import etree
If your code only uses the ElementTree API and does not rely on any
-functionality that is specific to ``lxml.etree``, you can also use the
-following import chain as a fall-back to the original ElementTree::
+functionality that is specific to ``lxml.etree``, you can also use (any part
+of) the following import chain as a fall-back to the original ElementTree::
try:
from lxml import etree
@@ -108,7 +108,7 @@
------------------
To make the access to these subelements as easy and straight forward as
-possible, elements behave exactly like normal Python lists::
+possible, elements behave like normal Python lists::
>>> child = root[0]
>>> print child.tag
@@ -133,7 +133,7 @@
>>> print end[0].tag
child3
- >>> root[0] = root[-1]
+ >>> root[0] = root[-1] # this moves the element!
>>> for child in root:
... print child.tag
child3
@@ -239,9 +239,9 @@
>>> print etree.tostring(root)
<root>TEXT</root>
-In many XML documents (so-called *data-centric* documents), this is the only
-place where text can be found. It is encapsulated by a leaf tag at the very
-bottom of the tree hierarchy.
+In many XML documents (*data-centric* documents), this is the only place where
+text can be found. It is encapsulated by a leaf tag at the very bottom of the
+tree hierarchy.
However, if XML is used for tagged text documents such as (X)HTML, text can
also appear between different elements, right in the middle of the tree::
@@ -249,9 +249,9 @@
<html><body>Hello<br/>World</body></html>
Here, the ``<br/>`` tag is surrounded by text. This is often referred to as
-*document-style* XML. Elements support this through their ``tail`` property.
-It contains the text that directly follows the element, up to the next element
-in the XML tree::
+*document-style* or *mixed-content* XML. Elements support this through their
+``tail`` property. It contains the text that directly follows the element, up
+to the next element in the XML tree::
>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body")
@@ -280,8 +280,8 @@
If you want to use this more often, you can wrap it in a function::
- >>> buildTextList = etree.XPath("//text()") # lxml.etree only!
- >>> print buildTextList(html)
+ >>> build_text_list = etree.XPath("//text()") # lxml.etree only!
+ >>> print build_text_list(html)
['TEXT', 'TAIL']
.. _XPath: xpathxslt.html#xpath
@@ -344,9 +344,148 @@
The parse() function
--------------------
+
Namespaces
==========
+The ElementTree API avoids `namespace prefixes`_ wherever possible and deploys
+the real namespaces instead::
+
+ >>> xhtml = etree.Element("{http://www.w3.org/1999/xhtml}html")
+ >>> body = etree.SubElement(xhtml, "{http://www.w3.org/1999/xhtml}body")
+ >>> body.text = "Hello World"
+
+ >>> print etree.tostring(xhtml, pretty_print=True)
+ <ns0:html xmlns:ns0="http://www.w3.org/1999/xhtml">
+ <ns0:body>Hello World</ns0:body>
+ </ns0:html>
+
+.. _`namespace prefixes`: http://www.w3.org/TR/xml-names/#ns-qualnames
+
+As you can see, prefixes only become important when you serialise the result.
+However, the above code becomes somewhat verbose due to the lengthy namespace
+names. And retyping or copying a string over and over again is error prone.
+It is therefore common practice to store a namespace URI in a global variable.
+To adapt the namespace prefixes for serialisation, you can also pass a mapping
+to the Element factory, e.g. to define the default namespace::
+
+ >>> XHTML_NAMESPACE = "http://www.w3.org/1999/xhtml"
+ >>> XHTML = "{%s}" % XHTML_NAMESPACE
+
+ >>> NSMAP = {None : XHTML_NAMESPACE} # the default namespace (no prefix)
+
+ >>> xhtml = etree.Element(XHTML + "html", nsmap=NSMAP) # lxml only!
+ >>> body = etree.SubElement(xhtml, XHTML + "body")
+ >>> body.text = "Hello World"
+
+ >>> print etree.tostring(xhtml, pretty_print=True)
+ <html xmlns="http://www.w3.org/1999/xhtml">
+ <body>Hello World</body>
+ </html>
+
+Namespaces on attributes work alike::
+
+ >>> body.set(XHTML + "bgcolor", "#CCFFAA")
+
+ >>> print etree.tostring(xhtml, pretty_print=True)
+ <html xmlns="http://www.w3.org/1999/xhtml">
+ <body bgcolor="#CCFFAA">Hello World</body>
+ </html>
+
+ >>> print body.get("bgcolor")
+ None
+ >>> body.get(XHTML + "bgcolor")
+ '#CCFFAA'
+
+You can also use XPath in this way::
+
+ >>> find_xhtml_body = etree.ETXPath( # lxml only !
+ ... "//{%s}body" % XHTML_NAMESPACE)
+ >>> results = find_xhtml_body(xhtml)
+
+ >>> print results[0].tag
+ {http://www.w3.org/1999/xhtml}body
+
+
+The E-factory
+=============
+
+The ``E-factory`` provides a simple and compact syntax for generating XML and
+HTML::
+
+ >>> from lxml.builder import E
+
+ >>> def CLASS(*args): # class is a reserved word in Python
+ ... return {"class":' '.join(args)}
+
+ >>> html = page = (
+ ... E.html( # create an Element called "html"
+ ... E.head(
+ ... E.title("This is a sample document")
+ ... ),
+ ... E.body(
+ ... E.h1("Hello!", CLASS("title")),
+ ... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
+ ... E.p("This is another paragraph, with a ",
+ ... E.a("link", href="http://www.python.org"), "."),
+ ... E.p("Here are some reservered characters: <spam&egg>."),
+ ... etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
+ ... )
+ ... )
+ ... )
+
+ >>> print etree.tostring(page, pretty_print=True)
+ <html>
+ <head>
+ <title>This is a sample document</title>
+ </head>
+ <body>
+ <h1 class="title">Hello!</h1>
+ <p>This is a paragraph with <b>bold</b> text in it!</p>
+ <p>This is another paragraph, with a <a href="http://www.python.org">link</a>.</p>
+ <p>Here are some reservered characters: <spam&egg>.</p>
+ <p>And finally an embedded XHTML fragment.</p>
+ </body>
+ </html>
+
+The Element creation based on attribute access makes it easy to build up a
+simple vocabulary for an XML language::
+
+ >>> DOC = E.doc
+ >>> TITLE = E.title
+ >>> SECTION = E.section
+ >>> PAR = E.par
+
+ >>> my_doc = DOC(
+ ... TITLE("The dog and the hog"),
+ ... SECTION(
+ ... TITLE("The dog"),
+ ... PAR("Once upon a time, ..."),
+ ... PAR("And then ...")
+ ... ),
+ ... SECTION(
+ ... TITLE("The hog"),
+ ... PAR("Sooner or later ...")
+ ... )
+ ... )
+
+ >>> print etree.tostring(my_doc, pretty_print=True)
+ <doc>
+ <title>The dog and the hog</title>
+ <section>
+ <title>The dog</title>
+ <par>Once upon a time, ...</par>
+ <par>And then ...</par>
+ </section>
+ <section>
+ <title>The hog</title>
+ <par>Sooner or later ...</par>
+ </section>
+ </doc>
+
+One such example is the module ``lxml.html.builder``, which provides a
+vocabulary for HTML.
+
ElementPath
===========
Modified: lxml/branch/html/setup.py
==============================================================================
--- lxml/branch/html/setup.py (original)
+++ lxml/branch/html/setup.py Fri Jun 29 19:05:43 2007
@@ -1,12 +1,21 @@
-from ez_setup import use_setuptools
-use_setuptools(version="0.5")
-
-from setuptools import setup
import sys, os
-# need to insert this to python path so we're sure we can import
-# versioninfo and setupinfo even if we start setup.py from another
-# location (such as a buildout)
+try:
+ try:
+ import pkg_resources
+ pkg_resources.require("setuptools>=0.6c5")
+ except pkg_resources.VersionConflict, e:
+ from ez_setup import use_setuptools
+ use_setuptools(version="0.6c5")
+ raise ImportError
+ from setuptools import setup
+except ImportError:
+ # not setuptools installed
+ from distutils.core import setup
+
+# need to insert this to python path so we're sure we can import versioninfo,
+# setupinfo and Pyrex (!) even if we start setup.py from another location
+# (such as a buildout)
sys.path.insert(0, os.path.dirname(__file__))
import versioninfo
Modified: lxml/branch/html/setupinfo.py
==============================================================================
--- lxml/branch/html/setupinfo.py (original)
+++ lxml/branch/html/setupinfo.py Fri Jun 29 19:05:43 2007
@@ -1,5 +1,8 @@
import sys, os
-from setuptools.extension import Extension
+try:
+ from setuptools.extension import Extension
+except ImportError:
+ from distutils.extension import Extension
try:
from Pyrex.Distutils import build_ext as build_pyx
Modified: lxml/branch/html/src/lxml/ElementInclude.py
==============================================================================
--- lxml/branch/html/src/lxml/ElementInclude.py (original)
+++ lxml/branch/html/src/lxml/ElementInclude.py Fri Jun 29 19:05:43 2007
@@ -46,6 +46,8 @@
##
import copy, etree
+from urlparse import urljoin
+from urllib2 import urlopen
try:
set
@@ -95,7 +97,12 @@
if parse == "xml":
data = etree.parse(href, parser).getroot()
else:
- data = open(href).read()
+ if "://" in href:
+ f = urlopen(href)
+ else:
+ f = open(href)
+ data = f.read()
+ f.close()
if encoding:
data = data.decode(encoding)
return data
@@ -121,15 +128,20 @@
# @throws IOError If the function fails to load a given resource.
# @returns the node or its replacement if it was an XInclude node
-def include(elem, loader=None):
- if hasattr(elem, 'getroot'):
- #if hasattr(elem, 'docinfo'):
- # base_url = elem.docinfo.URL
- _include(elem.getroot(), loader)
- else:
- _include(elem, loader)
+def include(elem, loader=None, base_url=None):
+ if base_url is None:
+ if hasattr(elem, 'getroot'):
+ tree = elem
+ elem = elem.getroot()
+ else:
+ tree = elem.getroottree()
+ if hasattr(tree, 'docinfo'):
+ base_url = tree.docinfo.URL
+ elif hasattr(elem, 'getroot'):
+ elem = elem.getroot()
+ _include(elem, loader, base_url=base_url)
-def _include(elem, loader=None, _parent_hrefs=None):
+def _include(elem, loader=None, _parent_hrefs=None, base_url=None):
if loader is not None:
load_include = _wrap_et_loader(loader)
else:
@@ -146,7 +158,7 @@
for e in include_elements:
if e.tag == XINCLUDE_INCLUDE:
# process xinclude directive
- href = e.get("href")
+ href = urljoin(base_url, e.get("href"))
parse = e.get("parse", "xml")
parent = e.getparent()
if parse == "xml":
Modified: lxml/branch/html/src/lxml/apihelpers.pxi
==============================================================================
--- lxml/branch/html/src/lxml/apihelpers.pxi (original)
+++ lxml/branch/html/src/lxml/apihelpers.pxi Fri Jun 29 19:05:43 2007
@@ -459,6 +459,8 @@
* its name string equals the c_name string
"""
cdef char* c_node_href
+ if c_node is NULL:
+ return 0
if c_node.type != tree.XML_ELEMENT_NODE:
# not an element, only succeed if we match everything
return c_name is NULL and c_href is NULL
@@ -485,11 +487,17 @@
else:
return 0
-cdef void _removeNode(xmlNode* c_node):
- """Unlink and free a node and subnodes if possible.
+cdef void _removeNode(_Document doc, xmlNode* c_node):
+ """Unlink and free a node and subnodes if possible. Otherwise, make sure
+ it's self-contained.
"""
+ cdef xmlNode* c_next
+ c_next = c_node.next
tree.xmlUnlinkNode(c_node)
- attemptDeallocation(c_node)
+ _moveTail(c_next, c_node)
+ if not attemptDeallocation(c_node):
+ # make namespaces absolute
+ moveNodeToDocument(doc, c_node)
cdef void _moveTail(xmlNode* c_tail, xmlNode* c_target):
cdef xmlNode* c_next
@@ -517,7 +525,8 @@
c_target = c_new_tail
c_tail = _textNodeOrSkip(c_tail.next)
-cdef xmlNode* _deleteSlice(xmlNode* c_node, Py_ssize_t start, Py_ssize_t stop):
+cdef xmlNode* _deleteSlice(_Document doc, xmlNode* c_node,
+ Py_ssize_t start, Py_ssize_t stop):
"""Delete slice, starting with c_node, start counting at start, end at stop.
"""
cdef xmlNode* c_next
@@ -529,9 +538,9 @@
while c_node is not NULL and c < stop:
c_next = c_node.next
if _isElement(c_node):
- _removeText(c_node.next)
- c_next = c_node.next
- _removeNode(c_node)
+ while c_next is not NULL and not _isElement(c_next):
+ c_next = c_next.next
+ _removeNode(doc, c_node)
c = c + 1
c_node = c_next
return c_node
@@ -550,7 +559,7 @@
_moveTail(c_next, c_node)
# uh oh, elements may be pointing to different doc when
# parent element has moved; change them too..
- moveNodeToDocument(child, parent._doc)
+ moveNodeToDocument(parent._doc, c_node)
cdef void _appendSibling(_Element element, _Element sibling):
"""Append a new child to a parent element.
@@ -566,7 +575,7 @@
_moveTail(c_next, c_node)
# uh oh, elements may be pointing to different doc when
# parent element has moved; change them too..
- moveNodeToDocument(sibling, element._doc)
+ moveNodeToDocument(element._doc, c_node)
cdef void _prependSibling(_Element element, _Element sibling):
"""Append a new child to a parent element.
@@ -582,7 +591,7 @@
_moveTail(c_next, c_node)
# uh oh, elements may be pointing to different doc when
# parent element has moved; change them too..
- moveNodeToDocument(sibling, element._doc)
+ moveNodeToDocument(element._doc, c_node)
cdef int isutf8(char* s):
cdef char c
@@ -598,16 +607,20 @@
cdef char* s
cdef char* c_end
cdef char c
+ cdef int is_non_ascii
s = _cstr(pystring)
c_end = s + python.PyString_GET_SIZE(pystring)
+ is_non_ascii = 0
while s < c_end:
c = s[0]
- if c == c'\0':
- return -1 # invalid!
if c & 0x80:
- return 1 # non-ASCII
+ is_non_ascii = 1
+ elif c == c'\0':
+ return -1 # invalid!
+ elif is_non_ascii == 0 and not tree.xmlIsChar_ch(c):
+ return -1 # invalid!
s = s + 1
- return 0 # plain 7-bit ASCII
+ return is_non_ascii
cdef object funicode(char* s):
cdef Py_ssize_t slen
@@ -628,12 +641,15 @@
cdef object _utf8(object s):
if python.PyString_Check(s):
assert not isutf8py(s), \
- "All strings must be Unicode or ASCII"
- return s
+ "All strings must be XML compatible, either Unicode or ASCII"
elif python.PyUnicode_Check(s):
- return python.PyUnicode_AsUTF8String(s)
+ # FIXME: we should test these strings, too ...
+ s = python.PyUnicode_AsUTF8String(s)
+ assert isutf8py(s) != -1, \
+ "All strings must be XML compatible, either Unicode or ASCII"
else:
raise TypeError, "Argument must be string or unicode."
+ return s
cdef object _encodeFilename(object filename):
if filename is None:
Modified: lxml/branch/html/src/lxml/builder.py
==============================================================================
--- lxml/branch/html/src/lxml/builder.py (original)
+++ lxml/branch/html/src/lxml/builder.py Fri Jun 29 19:05:43 2007
@@ -1,10 +1,37 @@
-"""
-Element generator factory by Fredrik Lundh.
-
-Source:
- http://online.effbot.org/2006_11_01_archive.htm#et-builder
- http://effbot.python-hosting.com/file/stuff/sandbox/elementlib/builder.py
-"""
+#
+# Element generator factory by Fredrik Lundh.
+#
+# Source:
+# http://online.effbot.org/2006_11_01_archive.htm#et-builder
+# http://effbot.python-hosting.com/file/stuff/sandbox/elementlib/builder.py
+#
+# --------------------------------------------------------------------
+# The ElementTree toolkit is
+#
+# Copyright (c) 1999-2004 by Fredrik Lundh
+#
+# By obtaining, using, and/or copying this software and/or its
+# associated documentation, you agree that you have read, understood,
+# and will comply with the following terms and conditions:
+#
+# Permission to use, copy, modify, and distribute this software and
+# its associated documentation for any purpose and without fee is
+# hereby granted, provided that the above copyright notice appears in
+# all copies, and that both that copyright notice and this permission
+# notice appear in supporting documentation, and that the name of
+# Secret Labs AB or the author not be used in advertising or publicity
+# pertaining to distribution of the software without specific, written
+# prior permission.
+#
+# SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD
+# TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANT-
+# ABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR
+# BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY
+# DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
+# WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
+# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
+# OF THIS SOFTWARE.
+# --------------------------------------------------------------------
import etree as ET
@@ -113,7 +140,10 @@
elem[-1].tail = (elem[-1].tail or "") + item
else:
elem.text = (elem.text or "") + item
- typemap[str] = typemap[unicode] = add_text
+ if str not in typemap:
+ typemap[str] = add_text
+ if unicode not in typemap:
+ typemap[unicode] = add_text
def add_dict(elem, item):
attrib = elem.attrib
@@ -122,7 +152,8 @@
attrib[k] = v
else:
attrib[k] = typemap[type(v)](None, v)
- typemap[dict] = add_dict
+ if dict not in typemap:
+ typemap[dict] = add_dict
self._typemap = typemap
Modified: lxml/branch/html/src/lxml/etree.pyx
==============================================================================
--- lxml/branch/html/src/lxml/etree.pyx (original)
+++ lxml/branch/html/src/lxml/etree.pyx Fri Jun 29 19:05:43 2007
@@ -243,8 +243,8 @@
#displayNode(<xmlNode*>self._c_doc, 0)
#print <long>self._c_doc, self._c_doc.dict is __GLOBAL_PARSER_CONTEXT._c_dict
#print <long>self._c_doc, canDeallocateChildNodes(<xmlNode*>self._c_doc)
- #tree.xmlFreeDoc(c_doc)
- _deallocDocument(self._c_doc)
+ tree.xmlFreeDoc(self._c_doc)
+ #_deallocDocument(self._c_doc)
cdef getroot(self):
cdef xmlNode* c_node
@@ -453,8 +453,9 @@
_removeText(c_node.next)
tree.xmlReplaceNode(c_node, element._c_node)
_moveTail(c_next, element._c_node)
- moveNodeToDocument(element, self._doc)
- attemptDeallocation(c_node)
+ moveNodeToDocument(self._doc, element._c_node)
+ if not attemptDeallocation(c_node):
+ moveNodeToDocument(self._doc, c_node)
def __delitem__(self, Py_ssize_t index):
"""Deletes the given subelement.
@@ -464,14 +465,14 @@
if c_node is NULL:
raise IndexError, index
_removeText(c_node.next)
- _removeNode(c_node)
+ _removeNode(self._doc, c_node)
def __delslice__(self, Py_ssize_t start, Py_ssize_t stop):
"""Deletes a number of subelements.
"""
cdef xmlNode* c_node
c_node = _findChild(self._c_node, start)
- _deleteSlice(c_node, start, stop)
+ _deleteSlice(self._doc, c_node, start, stop)
def __setslice__(self, Py_ssize_t start, Py_ssize_t stop, value):
"""Replaces a number of subelements with elements
@@ -486,8 +487,8 @@
else:
c_node = _findChild(self._c_node, start)
# now delete the slice
- if start != stop:
- c_node = _deleteSlice(c_node, start, stop)
+ if c_node is not NULL and start != stop:
+ c_node = _deleteSlice(self._doc, c_node, start, stop)
# if the insertion point is at the end, append there
if c_node is NULL:
for element in value:
@@ -500,12 +501,11 @@
# store possible text tail
c_next = element._c_node.next
# now move node previous to insertion point
- tree.xmlUnlinkNode(element._c_node)
tree.xmlAddPrevSibling(c_node, element._c_node)
# and move tail just behind his node
_moveTail(c_next, element._c_node)
# move it into a new document
- moveNodeToDocument(element, self._doc)
+ moveNodeToDocument(self._doc, element._c_node)
def __deepcopy__(self, memo):
return self.__copy__()
@@ -597,9 +597,9 @@
while c_node is not NULL:
c_node_next = c_node.next
if _isElement(c_node):
- _removeText(c_node_next)
- c_node_next = c_node.next
- _removeNode(c_node)
+ while c_node_next is not NULL and not _isElement(c_node_next):
+ c_node_next = c_node_next.next
+ _removeNode(self._doc, c_node)
c_node = c_node_next
def insert(self, index, _Element element not None):
@@ -614,7 +614,7 @@
c_next = element._c_node.next
tree.xmlAddPrevSibling(c_node, element._c_node)
_moveTail(c_next, element._c_node)
- moveNodeToDocument(element, self._doc)
+ moveNodeToDocument(self._doc, element._c_node)
def remove(self, _Element element not None):
"""Removes a matching subelement. Unlike the find methods, this
@@ -629,6 +629,8 @@
c_next = element._c_node.next
tree.xmlUnlinkNode(c_node)
_moveTail(c_next, c_node)
+ # fix namespace declarations
+ moveNodeToDocument(self._doc, c_node)
def replace(self, _Element old_element not None,
_Element new_element not None):
@@ -647,7 +649,9 @@
tree.xmlReplaceNode(c_old_node, c_new_node)
_moveTail(c_new_next, c_new_node)
_moveTail(c_old_next, c_old_node)
- moveNodeToDocument(new_element, self._doc)
+ moveNodeToDocument(self._doc, c_new_node)
+ # fix namespace declarations
+ moveNodeToDocument(self._doc, c_old_node)
# PROPERTIES
property tag:
@@ -1424,7 +1428,7 @@
FTP.
Note that XInclude does not support custom resolvers in Python space
- due to restrictions of libxml2 <= 2.6.28.
+ due to restrictions of libxml2 <= 2.6.29.
"""
cdef python.PyThreadState* state
cdef int result
@@ -1496,10 +1500,11 @@
if python.PyTuple_GET_SIZE(default) == 0:
raise KeyError, key
else:
- return python.PyTuple_GET_ITEM(default, 0)
+ result = python.PyTuple_GET_ITEM(default, 0)
+ python.Py_INCREF(result)
else:
_delAttribute(self._element, key)
- return result
+ return result
def clear(self):
cdef xmlNode* c_node
Modified: lxml/branch/html/src/lxml/iterparse.pxi
==============================================================================
--- lxml/branch/html/src/lxml/iterparse.pxi (original)
+++ lxml/branch/html/src/lxml/iterparse.pxi Fri Jun 29 19:05:43 2007
@@ -3,7 +3,7 @@
cdef object __ITERPARSE_CHUNK_SIZE
__ITERPARSE_CHUNK_SIZE = 32768
-ctypedef enum IterparseEventFilter:
+ctypedef enum _IterparseEventFilter:
ITERPARSE_FILTER_START = 1
ITERPARSE_FILTER_END = 2
ITERPARSE_FILTER_START_NS = 4
@@ -234,13 +234,15 @@
* load_dtd - use DTD for parsing
* no_network - prevent network access
* remove_blank_text - discard blank text nodes
+ * remove_comments - discard comments
"""
cdef object _source
cdef object _filename
cdef readonly object root
def __init__(self, source, events=("end",), tag=None,
attribute_defaults=False, dtd_validation=False,
- load_dtd=False, no_network=False, remove_blank_text=False):
+ load_dtd=False, no_network=False, remove_blank_text=False,
+ remove_comments=False):
cdef _IterparseContext context
cdef char* c_filename
cdef int parse_options
@@ -257,7 +259,7 @@
c_filename = NULL
self._source = source
- _BaseParser.__init__(self, _IterparseContext)
+ _BaseParser.__init__(self, remove_comments, _IterparseContext)
parse_options = _XML_DEFAULT_PARSE_OPTIONS
if load_dtd:
Modified: lxml/branch/html/src/lxml/objectify.pyx
==============================================================================
--- lxml/branch/html/src/lxml/objectify.pyx (original)
+++ lxml/branch/html/src/lxml/objectify.pyx Fri Jun 29 19:05:43 2007
@@ -65,6 +65,8 @@
cdef object islice
from itertools import islice
+cdef object _ElementMaker
+from builder import ElementMaker as _ElementMaker
# namespace/name for "pytype" hint attribute
cdef object PYTYPE_NAMESPACE
@@ -759,7 +761,7 @@
return self.__nonzero__()
def __checkBool(s):
- if s != 'true' and s != 'false':
+ if s != 'true' and s != 'false' and s != '1' and s != '0':
raise ValueError
cdef object _strValueOf(obj):
@@ -903,7 +905,7 @@
pytype.register()
pytype = PyType('float', float, FloatElement)
- pytype.xmlSchemaTypes = ("float", "double")
+ pytype.xmlSchemaTypes = ("double", "float")
pytype.register()
pytype = PyType('bool', __checkBool, BoolElement)
@@ -1455,8 +1457,6 @@
# Type annotations
cdef PyType _check_type(tree.xmlNode* c_node, PyType pytype):
- # StrType does not have a typecheck but is the default anyway,
- # so just accept it if given as type information
if pytype is None:
return None
value = textOf(c_node)
@@ -1468,34 +1468,114 @@
pass
return None
-def annotate(element_or_tree, ignore_old=True):
+def annotate(element_or_tree, ignore_old=True, ignore_xsi=False,
+ empty_pytype=None):
"""Recursively annotates the elements of an XML tree with 'pytype'
attributes.
If the 'ignore_old' keyword argument is True (the default), current 'pytype'
attributes will be ignored and replaced. Otherwise, they will be checked
and only replaced if they no longer fit the current text value.
+
+ Setting the keyword argument ``ignore_xsi`` to True makes the function
+ additionally ignore existing ``xsi:type`` annotations. The default is to
+ use them as a type hint.
+
+ The default annotation of empty elements can be set with the
+ ``empty_pytype`` keyword argument. The default is not to annotate empty
+ elements. Pass 'str', for example, to make string values the default.
"""
cdef _Element element
+ element = cetree.rootNodeOrRaise(element_or_tree)
+ _annotate(element, 0, 1, bool(ignore_xsi), bool(ignore_old),
+ None, empty_pytype)
+
+def xsiannotate(element_or_tree, ignore_old=True, ignore_pytype=False,
+ empty_type=None):
+ """Recursively annotates the elements of an XML tree with 'xsi:type'
+ attributes.
+
+ If the 'ignore_old' keyword argument is True (the default), current
+ 'xsi:type' attributes will be ignored and replaced. Otherwise, they will be
+ checked and only replaced if they no longer fit the current text value.
+
+ Note that the mapping from Python types to XSI types is usually ambiguous.
+ Currently, only the first XSI type name in the corresponding PyType
+ definition will be used for annotation. Thus, you should consider naming
+ the widest type first if you define additional types.
+
+ Setting the keyword argument ``ignore_pytype`` to True makes the function
+ additionally ignore existing ``pytype`` annotations. The default is to
+ use them as a type hint.
+
+ The default annotation of empty elements can be set with the
+ ``empty_type`` keyword argument. The default is not to annotate empty
+ elements. Pass 'string', for example, to make string values the default.
+ """
+ cdef _Element element
+ element = cetree.rootNodeOrRaise(element_or_tree)
+ _annotate(element, 1, 0, bool(ignore_old), bool(ignore_pytype),
+ empty_type, None)
+
+cdef _annotate(_Element element, int annotate_xsi, int annotate_pytype,
+ int ignore_xsi, int ignore_pytype,
+ empty_type_name, empty_pytype_name):
cdef _Document doc
- cdef int ignore
cdef tree.xmlNode* c_node
cdef tree.xmlNs* c_ns
cdef python.PyObject* dict_result
- cdef PyType pytype
- element = cetree.rootNodeOrRaise(element_or_tree)
+ cdef PyType pytype, empty_pytype, StrType, NoneType
+
+ if not annotate_xsi and not annotate_pytype:
+ return
+
doc = element._doc
- ignore = bool(ignore_old)
+
+ if empty_type_name is not None:
+ dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, empty_type_name)
+ elif empty_pytype_name is not None:
+ dict_result = python.PyDict_GetItem(_PYTYPE_DICT, empty_pytype_name)
+ else:
+ dict_result = NULL
+ if dict_result is not NULL:
+ empty_pytype = <PyType>dict_result
+ else:
+ empty_pytype = None
StrType = _PYTYPE_DICT.get('str')
NoneType = _PYTYPE_DICT.get('none')
c_node = element._c_node
tree.BEGIN_FOR_EACH_ELEMENT_FROM(c_node, c_node, 1)
if c_node.type == tree.XML_ELEMENT_NODE:
+ typename = None
pytype = None
value = None
- if not ignore:
- # check that old value is valid
+ istree = 0
+ # if element is defined as xsi:nil, represent it as None
+ if cetree.attributeValueFromNsName(
+ c_node, _XML_SCHEMA_INSTANCE_NS, "nil") == "true":
+ pytype = NoneType
+
+ if pytype is None and not ignore_xsi:
+ # check that old xsi type value is valid
+ typename = cetree.attributeValueFromNsName(
+ c_node, _XML_SCHEMA_INSTANCE_NS, "type")
+ if typename is not None:
+ dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename)
+ if dict_result is NULL and ':' in typename:
+ prefix, typename = typename.split(':', 1)
+ dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, typename)
+ if dict_result is not NULL:
+ pytype = <PyType>dict_result
+ if pytype is not StrType:
+ # StrType does not have a typecheck but is the default anyway,
+ # so just accept it if given as type information
+ pytype = _check_type(c_node, pytype)
+ if pytype is None:
+ typename = None
+
+ if pytype is None and not ignore_pytype:
+ # check that old pytype value is valid
old_value = cetree.attributeValueFromNsName(
c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME)
if old_value is not None and old_value != TREE_PYTYPE:
@@ -1508,43 +1588,73 @@
pytype = _check_type(c_node, pytype)
if pytype is None:
- # if element is defined as xsi:nil, represent it as None
- if cetree.attributeValueFromNsName(
- c_node, _XML_SCHEMA_INSTANCE_NS, "nil") == "true":
- pytype = NoneType
-
- if pytype is None:
- # check for XML Schema type hint
- value = cetree.attributeValueFromNsName(
- c_node, _XML_SCHEMA_INSTANCE_NS, "type")
-
- if value is not None:
- dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value)
- if dict_result is NULL and ':' in value:
- prefix, value = value.split(':', 1)
- dict_result = python.PyDict_GetItem(_SCHEMA_TYPE_DICT, value)
- if dict_result is not NULL:
- pytype = <PyType>dict_result
-
- if pytype is None:
# try to guess type
if cetree.findChildForwards(c_node, 0) is NULL:
# element has no children => data class
pytype = _guessPyType(textOf(c_node), StrType)
+ else:
+ istree = 1
if pytype is None:
- # delete attribute if it exists
- cetree.delAttributeFromNsName(
- c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME)
- else:
- # update or create attribute
- c_ns = cetree.findOrBuildNodeNsPrefix(
- doc, c_node, _PYTYPE_NAMESPACE, 'py')
- tree.xmlSetNsProp(c_node, c_ns, _PYTYPE_ATTRIBUTE_NAME,
- _cstr(pytype.name))
+ # use default type for empty elements
+ if textOf(c_node) is None:
+ pytype = empty_pytype
+ if typename is None:
+ typename = empty_type_name
+ else:
+ pytype = StrType
+
+ if pytype is not None:
+ if typename is None:
+ if not istree:
+ if python.PyList_GET_SIZE(pytype._schema_types) > 0:
+ # pytype->xsi:type is a 1:n mapping
+ # simply take the first
+ typename = pytype._schema_types[0]
+ elif typename not in pytype._schema_types:
+ typename = pytype._schema_types[0]
+
+ if annotate_xsi:
+ if typename is None or istree:
+ cetree.delAttributeFromNsName(
+ c_node, _XML_SCHEMA_INSTANCE_NS, "type")
+ else:
+ # update or create attribute
+ c_ns = cetree.findOrBuildNodeNsPrefix(
+ doc, c_node, _XML_SCHEMA_NS, 'xsd')
+ if c_ns is not NULL:
+ if ':' in typename:
+ prefix, name = typename.split(':', 1)
+ if c_ns.prefix is NULL or c_ns.prefix[0] == c'\0':
+ typename = name
+ elif cstd.strcmp(_cstr(prefix), c_ns.prefix) != 0:
+ prefix = c_ns.prefix
+ typename = prefix + ':' + name
+ elif c_ns.prefix is not NULL or c_ns.prefix[0] != c'\0':
+ prefix = c_ns.prefix
+ typename = prefix + ':' + typename
+ c_ns = cetree.findOrBuildNodeNsPrefix(
+ doc, c_node, _XML_SCHEMA_INSTANCE_NS, 'xsi')
+ tree.xmlSetNsProp(c_node, c_ns, "type", _cstr(typename))
+
+ if annotate_pytype:
+ if pytype is None:
+ # delete attribute if it exists
+ cetree.delAttributeFromNsName(
+ c_node, _PYTYPE_NAMESPACE, _PYTYPE_ATTRIBUTE_NAME)
+ else:
+ # update or create attribute
+ c_ns = cetree.findOrBuildNodeNsPrefix(
+ doc, c_node, _PYTYPE_NAMESPACE, 'py')
+ tree.xmlSetNsProp(c_node, c_ns, _PYTYPE_ATTRIBUTE_NAME,
+ _cstr(pytype.name))
+ if pytype is NoneType:
+ c_ns = cetree.findOrBuildNodeNsPrefix(
+ doc, c_node, _XML_SCHEMA_INSTANCE_NS, 'xsi')
+ tree.xmlSetNsProp(c_node, c_ns, "nil", "true")
tree.END_FOR_EACH_ELEMENT_FROM(c_node)
-def xsiannotate(element_or_tree, ignore_old=True):
+def __xsiannotate(element_or_tree, ignore_old=True):
"""Recursively annotates the elements of an XML tree with 'xsi:type'
attributes.
@@ -1694,6 +1804,9 @@
objectify_parser = __DEFAULT_PARSER
def setDefaultParser(new_parser = None):
+ set_default_parser(new_parser)
+
+def set_default_parser(new_parser = None):
"""Replace the default parser used by objectify's Element() and
fromstring() functions.
@@ -1735,6 +1848,42 @@
parser = objectify_parser
return _parse(f, parser)
+class ElementMaker(_ElementMaker):
+ def __init__(self, typemap=None):
+ if typemap is None:
+ typemap = {}
+ else:
+ typemap = typemap.copy()
+
+ typemap[__builtin__.str] = __add_text
+ typemap[__builtin__.unicode] = __add_text
+ typemap[__builtin__.int] = __add_text
+ typemap[__builtin__.long] = __add_text
+ typemap[__builtin__.float] = __add_text
+ typemap[__builtin__.bool] = __add_text
+
+ _ElementMaker.__init__(self, typemap, objectify_parser.makeelement)
+
+def __add_text(_Element elem not None, text):
+ cdef tree.xmlNode* c_child
+ if isinstance(text, bool):
+ text = str(text).lower()
+ else:
+ text = str(text)
+ c_child = cetree.findChildBackwards(elem._c_node, 0)
+ if c_child is not NULL:
+ old = cetree.tailOf(c_child)
+ if old is not None:
+ text = old + text
+ cetree.setTailText(c_child, text)
+ else:
+ old = cetree.textOf(elem._c_node)
+ if old is not None:
+ text = old + text
+ cetree.setNodeText(elem._c_node, text)
+
+E = ElementMaker()
+
cdef object _DEFAULT_NSMAP
_DEFAULT_NSMAP = { "py" : PYTYPE_NAMESPACE,
"xsi" : XML_SCHEMA_INSTANCE_NS,
Modified: lxml/branch/html/src/lxml/parser.pxi
==============================================================================
--- lxml/branch/html/src/lxml/parser.pxi (original)
+++ lxml/branch/html/src/lxml/parser.pxi Fri Jun 29 19:05:43 2007
@@ -367,7 +367,7 @@
cdef ElementClassLookup _class_lookup
cdef python.PyThread_type_lock _parser_lock
- def __init__(self, context_class=_ResolverContext):
+ def __init__(self, remove_comments, context_class=_ResolverContext):
cdef xmlParserCtxt* pctxt
if isinstance(self, HTMLParser):
self._parser_type = LXML_HTML_PARSER
@@ -384,8 +384,11 @@
if pctxt is NULL:
python.PyErr_NoMemory()
if pctxt.sax != NULL:
+ if remove_comments:
+ pctxt.sax.comment = NULL
# hard switch-off for CDATA nodes => makes them plain text
pctxt.sax.cdataBlock = NULL
+
if not config.ENABLE_THREADING or \
self._parser_type == LXML_ITERPARSE_PARSER:
# no threading
@@ -690,6 +693,7 @@
* ns_clean - clean up redundant namespace declarations
* recover - try hard to parse through broken XML
* remove_blank_text - discard blank text nodes
+ * remove_comments - discard comments
* compact - safe memory for short text content (default: True)
* resolve_entities - replace entities by their text value (default: True)
@@ -700,9 +704,9 @@
def __init__(self, attribute_defaults=False, dtd_validation=False,
load_dtd=False, no_network=True, ns_clean=False,
recover=False, remove_blank_text=False, compact=True,
- resolve_entities=True):
+ resolve_entities=True, remove_comments=False):
cdef int parse_options
- _BaseParser.__init__(self)
+ _BaseParser.__init__(self, remove_comments)
parse_options = _XML_DEFAULT_PARSE_OPTIONS
if load_dtd:
@@ -823,15 +827,16 @@
* recover - try hard to parse through broken HTML (default: True)
* no_network - prevent network access (default: True)
* remove_blank_text - discard empty text nodes
+ * remove_comments - discard comments
* compact - safe memory for short text content (default: True)
Note that you should avoid sharing parsers between threads for performance
reasons.
"""
def __init__(self, recover=True, no_network=True, remove_blank_text=False,
- compact=True):
+ compact=True, remove_comments=False):
cdef int parse_options
- _BaseParser.__init__(self)
+ _BaseParser.__init__(self, remove_comments)
parse_options = _HTML_DEFAULT_PARSE_OPTIONS
if remove_blank_text:
Modified: lxml/branch/html/src/lxml/proxy.pxi
==============================================================================
--- lxml/branch/html/src/lxml/proxy.pxi (original)
+++ lxml/branch/html/src/lxml/proxy.pxi Fri Jun 29 19:05:43 2007
@@ -27,6 +27,8 @@
#print "registering for:", <int>proxy._c_node
assert c_node._private is NULL, "double registering proxy!"
c_node._private = <void*>proxy
+ # additional INCREF to make sure _Document is GC-ed LAST!
+ python.Py_INCREF(proxy._doc)
cdef unregisterProxy(_Element proxy):
"""Unregister a proxy for the node it's proxying for.
@@ -35,6 +37,7 @@
c_node = proxy._c_node
assert c_node._private is <void*>proxy, "Tried to unregister unknown proxy"
c_node._private = NULL
+ python.Py_DECREF(proxy._doc)
################################################################################
# temporarily make a node the root node of its document
@@ -56,6 +59,7 @@
c_new_root = tree.xmlDocCopyNode(c_node, c_doc, 2) # non recursive!
tree.xmlDocSetRootElement(c_doc, c_new_root)
_copyParentNamespaces(c_node, c_new_root)
+ _copyParentNamespaces(c_node, c_root)
c_new_root.children = c_node.children
c_new_root.last = c_node.last
@@ -115,19 +119,21 @@
################################################################################
# support for freeing tree elements when proxy objects are destroyed
-cdef void attemptDeallocation(xmlNode* c_node):
+cdef int attemptDeallocation(xmlNode* c_node):
"""Attempt deallocation of c_node (or higher up in tree).
"""
cdef xmlNode* c_top
# could be we actually aren't referring to the tree at all
if c_node is NULL:
#print "not freeing, node is NULL"
- return
+ return 0
c_top = getDeallocationTop(c_node)
if c_top is not NULL:
#print "freeing:", c_top.name
_removeText(c_top.next) # tail
tree.xmlFreeNode(c_top)
+ return 1
+ return 0
cdef xmlNode* getDeallocationTop(xmlNode* c_node):
"""Return the top of the tree that can be deallocated, or NULL.
@@ -167,30 +173,30 @@
tree.END_FOR_EACH_ELEMENT_FROM(c_node)
return 1
-cdef void _deallocDocument(xmlDoc* c_doc):
- """We cannot rely on Python's GC to *always* dealloc the _Document *after*
- all proxies it contains => traverse the document and mark all its proxies
- as dead by deleting their xmlNode* reference.
- """
- cdef xmlNode* c_node
- c_node = c_doc.children
- tree.BEGIN_FOR_EACH_ELEMENT_FROM(<xmlNode*>c_doc, c_node, 1)
- if c_node._private is not NULL:
- (<_Element>c_node._private)._c_node = NULL
- tree.END_FOR_EACH_ELEMENT_FROM(c_node)
- tree.xmlFreeDoc(c_doc)
+## cdef void _deallocDocument(xmlDoc* c_doc):
+## """We cannot rely on Python's GC to *always* dealloc the _Document *after*
+## all proxies it contains => traverse the document and mark all its proxies
+## as dead by deleting their xmlNode* reference.
+## """
+## cdef xmlNode* c_node
+## c_node = c_doc.children
+## tree.BEGIN_FOR_EACH_ELEMENT_FROM(<xmlNode*>c_doc, c_node, 1)
+## if c_node._private is not NULL:
+## (<_Element>c_node._private)._c_node = NULL
+## tree.END_FOR_EACH_ELEMENT_FROM(c_node)
+## tree.xmlFreeDoc(c_doc)
################################################################################
# fix _Document references and namespaces when a node changes documents
-cdef void moveNodeToDocument(_Element node, _Document doc):
+cdef void moveNodeToDocument(_Document doc, xmlNode* c_element):
"""Fix the xmlNs pointers of a node and its subtree that were moved.
Mainly copied from libxml2's xmlReconciliateNs(). Expects libxml2 doc
pointers of node to be correct already, but fixes _Document references.
"""
+ cdef _Element element
cdef xmlDoc* c_doc
- cdef xmlNode* c_element
cdef xmlNode* c_start_node
cdef xmlNode* c_node
cdef xmlNs** c_ns_new_cache
@@ -201,12 +207,10 @@
cdef xmlNs* c_last_del_ns
cdef cstd.size_t i, c_cache_size, c_cache_last
- c_element = node._c_node
- c_doc = c_element.doc
-
if not tree._isElementOrXInclude(c_element):
return
+ c_doc = c_element.doc
c_start_node = c_element
c_ns_new_cache = NULL
c_ns_old_cache = NULL
@@ -300,7 +304,11 @@
# fix _Document reference (may dealloc the original document!)
if c_element._private is not NULL:
- (<_Element>c_element._private)._doc = doc
+ element = <_Element>c_element._private
+ if element._doc is not doc:
+ python.Py_INCREF(doc)
+ python.Py_DECREF(element._doc)
+ element._doc = doc
if c_element is c_start_node:
break
@@ -318,7 +326,11 @@
# fix _Document reference (may dealloc the original document!)
if c_element._private is not NULL:
- (<_Element>c_element._private)._doc = doc
+ element = <_Element>c_element._private
+ if element._doc is not doc:
+ python.Py_INCREF(doc)
+ python.Py_DECREF(element._doc)
+ element._doc = doc
if c_element is c_start_node:
break
Modified: lxml/branch/html/src/lxml/python.pxd
==============================================================================
--- lxml/branch/html/src/lxml/python.pxd (original)
+++ lxml/branch/html/src/lxml/python.pxd Fri Jun 29 19:05:43 2007
@@ -9,6 +9,7 @@
cdef int PY_SSIZE_T_MAX
cdef void Py_INCREF(object o)
+ cdef void Py_DECREF(object o)
cdef FILE* PyFile_AsFile(object p)
cdef int PyFile_Check(object p)
Modified: lxml/branch/html/src/lxml/tests/test_elementtree.py
==============================================================================
--- lxml/branch/html/src/lxml/tests/test_elementtree.py (original)
+++ lxml/branch/html/src/lxml/tests/test_elementtree.py Fri Jun 29 19:05:43 2007
@@ -1161,6 +1161,26 @@
self.assertXML('<b><bs></bs></b>', b)
self.assertXML('<c><cs></cs></c>', c)
+ def test_delslice_tail(self):
+ XML = self.etree.XML
+ a = XML('<a><b></b>B2<c></c>C2</a>')
+ b, c = a
+
+ del a[:]
+
+ self.assertEquals("B2", b.tail)
+ self.assertEquals("C2", c.tail)
+
+ def test_replace_slice_tail(self):
+ XML = self.etree.XML
+ a = XML('<a><b></b>B2<c></c>C2</a>')
+ b, c = a
+
+ a[:] = []
+
+ self.assertEquals("B2", b.tail)
+ self.assertEquals("C2", c.tail)
+
def test_delitem_tail(self):
ElementTree = self.etree.ElementTree
f = StringIO('<a><b></b>B2<c></c>C2</a>')
@@ -1305,6 +1325,22 @@
self.assertXML(
'<a><c></c></a>',
a)
+
+ def test_remove_ns(self):
+ Element = self.etree.Element
+ SubElement = self.etree.SubElement
+
+ a = Element('{http://test}a')
+ b = SubElement(a, '{http://test}b')
+ c = SubElement(a, '{http://test}c')
+
+ a.remove(b)
+ self.assertXML(
+ '<ns0:a xmlns:ns0="http://test"><ns0:c></ns0:c></ns0:a>',
+ a)
+ self.assertXML(
+ '<ns0:b xmlns:ns0="http://test"></ns0:b>',
+ b)
def test_remove_nonexisting(self):
Element = self.etree.Element
Modified: lxml/branch/html/src/lxml/tests/test_etree.py
==============================================================================
--- lxml/branch/html/src/lxml/tests/test_etree.py (original)
+++ lxml/branch/html/src/lxml/tests/test_etree.py Fri Jun 29 19:05:43 2007
@@ -161,6 +161,18 @@
self.assertRaises(SyntaxError, parse, f)
f.close()
+ def test_parse_remove_comments(self):
+ parse = self.etree.parse
+ tostring = self.etree.tostring
+ XMLParser = self.etree.XMLParser
+
+ f = StringIO('<a><!--A--><b><!-- B --><c/></b><!--C--></a>')
+ parser = XMLParser(remove_comments=True)
+ tree = parse(f, parser)
+ self.assertEquals(
+ '<a><b><c/></b></a>',
+ tostring(tree))
+
def test_parse_parser_type_error(self):
# ET raises IOError only
parse = self.etree.parse
@@ -195,6 +207,30 @@
self.assertRaises(SyntaxError, parse, f)
f.close()
+ def test_iterparse_comments(self):
+ # ET removes comments
+ iterparse = self.etree.iterparse
+ tostring = self.etree.tostring
+
+ f = StringIO('<a><!--A--><b><!-- B --><c/></b><!--C--></a>')
+ events = list(iterparse(f))
+ root = events[-1][1]
+ self.assertEquals(3, len(events))
+ self.assertEquals(
+ '<a><!--A--><b><!-- B --><c/></b><!--C--></a>',
+ tostring(root))
+
+ def test_iterparse_remove_comments(self):
+ iterparse = self.etree.iterparse
+ tostring = self.etree.tostring
+
+ f = StringIO('<a><!--A--><b><!-- B --><c/></b><!--C--></a>')
+ events = list(iterparse(f, remove_comments=True))
+ root = events[-1][1]
+ self.assertEquals(
+ '<a><b><c/></b></a>',
+ tostring(root))
+
def test_iterparse_broken(self):
iterparse = self.etree.iterparse
f = StringIO('<a><b><c/></a>')
@@ -1387,12 +1423,33 @@
def test_sourceline_parse(self):
parse = self.etree.parse
- tree = parse(fileInTestDir('test_xinclude.xml'))
+ tree = parse(fileInTestDir('include/test_xinclude.xml'))
self.assertEquals(
[1, 2, 3],
[ el.sourceline for el in tree.getiterator() ])
+ def test_sourceline_iterparse_end(self):
+ iterparse = self.etree.iterparse
+ lines = list(
+ el.sourceline for (event, el) in
+ iterparse(fileInTestDir('include/test_xinclude.xml')))
+
+ self.assertEquals(
+ [2, 3, 1],
+ lines)
+
+ def test_sourceline_iterparse_start(self):
+ iterparse = self.etree.iterparse
+ lines = list(
+ el.sourceline for (event, el) in
+ iterparse(fileInTestDir('include/test_xinclude.xml'),
+ events=("start",)))
+
+ self.assertEquals(
+ [1, 2, 3],
+ lines)
+
def test_sourceline_element(self):
Element = self.etree.Element
SubElement = self.etree.SubElement
@@ -1458,6 +1515,41 @@
self.assertRaises(AssertionError, Element, 'ha\0ho')
+ def test_unicode_byte_zero(self):
+ Element = self.etree.Element
+
+ a = Element('a')
+ self.assertRaises(AssertionError, setattr, a, "text", u'ha\0ho')
+ self.assertRaises(AssertionError, setattr, a, "tail", u'ha\0ho')
+
+ self.assertRaises(AssertionError, Element, u'ha\0ho')
+
+ def test_byte_invalid(self):
+ Element = self.etree.Element
+
+ a = Element('a')
+ self.assertRaises(AssertionError, setattr, a, "text", 'ha\x07ho')
+ self.assertRaises(AssertionError, setattr, a, "text", 'ha\x02ho')
+
+ self.assertRaises(AssertionError, setattr, a, "tail", 'ha\x07ho')
+ self.assertRaises(AssertionError, setattr, a, "tail", 'ha\x02ho')
+
+ self.assertRaises(AssertionError, Element, 'ha\x07ho')
+ self.assertRaises(AssertionError, Element, 'ha\x02ho')
+
+ def test_unicode_byte_invalid(self):
+ Element = self.etree.Element
+
+ a = Element('a')
+ self.assertRaises(AssertionError, setattr, a, "text", u'ha\x07ho')
+ self.assertRaises(AssertionError, setattr, a, "text", u'ha\x02ho')
+
+ self.assertRaises(AssertionError, setattr, a, "tail", u'ha\x07ho')
+ self.assertRaises(AssertionError, setattr, a, "tail", u'ha\x02ho')
+
+ self.assertRaises(AssertionError, Element, u'ha\x07ho')
+ self.assertRaises(AssertionError, Element, u'ha\x02ho')
+
def test_encoding_tostring_utf16(self):
# ElementTree fails to serialize this
tostring = self.etree.tostring
@@ -1588,12 +1680,11 @@
self.assertEquals(old_text + content + old_tail,
root.text)
-class ETreeXIncludeTestCase(XIncludeTestCase):
- def include(self, tree):
- tree.xinclude()
-
def test_xinclude(self):
- tree = etree.parse(fileInTestDir('test_xinclude.xml'))
+ tree = etree.parse(fileInTestDir('include/test_xinclude.xml'))
+ self.assertNotEquals(
+ 'a',
+ tree.getroot()[1].tag)
# process xincludes
self.include( tree )
# check whether we find it replaced with included data
@@ -1601,6 +1692,10 @@
'a',
tree.getroot()[1].tag)
+class ETreeXIncludeTestCase(XIncludeTestCase):
+ def include(self, tree):
+ tree.xinclude()
+
class ElementIncludeTestCase(XIncludeTestCase):
from lxml import ElementInclude
Modified: lxml/branch/html/src/lxml/tests/test_htmlparser.py
==============================================================================
--- lxml/branch/html/src/lxml/tests/test_htmlparser.py (original)
+++ lxml/branch/html/src/lxml/tests/test_htmlparser.py Fri Jun 29 19:05:43 2007
@@ -30,7 +30,7 @@
def test_module_HTML_unicode(self):
element = self.etree.HTML(self.uhtml_str)
self.assertEqual(unicode(self.etree.tostring(element, 'UTF8'), 'UTF8'),
- self.uhtml_str)
+ unicode(self.uhtml_str.encode('UTF8'), 'UTF8'))
def test_module_parse_html_error(self):
parser = self.etree.HTMLParser(recover=False)
Modified: lxml/branch/html/src/lxml/tests/test_objectify.py
==============================================================================
--- lxml/branch/html/src/lxml/tests/test_objectify.py (original)
+++ lxml/branch/html/src/lxml/tests/test_objectify.py Fri Jun 29 19:05:43 2007
@@ -555,6 +555,26 @@
self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR))
+ def test_pytype_annotation_empty(self):
+ XML = self.XML
+ root = XML(u'''\
+ <a xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xmlns:py="http://codespeak.net/lxml/objectify/pytype">
+ <n></n>
+ </a>
+ ''')
+ objectify.annotate(root)
+
+ child_types = [ c.get(objectify.PYTYPE_ATTRIBUTE)
+ for c in root.iterchildren() ]
+ self.assertEquals(None, child_types[0])
+
+ objectify.annotate(root, empty_pytype="str")
+
+ child_types = [ c.get(objectify.PYTYPE_ATTRIBUTE)
+ for c in root.iterchildren() ]
+ self.assertEquals("str", child_types[0])
+
def test_pytype_annotation_use_old(self):
XML = self.XML
root = XML(u'''\
@@ -579,19 +599,19 @@
child_types = [ c.get(objectify.PYTYPE_ATTRIBUTE)
for c in root.iterchildren() ]
- self.assertEquals("int", child_types[0])
- self.assertEquals("str", child_types[1])
- self.assertEquals("float", child_types[2])
- self.assertEquals("str", child_types[3])
- self.assertEquals("bool", child_types[4])
- self.assertEquals("none", child_types[5])
- self.assertEquals(None, child_types[6])
- self.assertEquals("float", child_types[7])
- self.assertEquals("float", child_types[8])
- self.assertEquals("str", child_types[9])
- self.assertEquals("str", child_types[10])
+ self.assertEquals("int", child_types[ 0])
+ self.assertEquals("str", child_types[ 1])
+ self.assertEquals("float", child_types[ 2])
+ self.assertEquals("str", child_types[ 3])
+ self.assertEquals("bool", child_types[ 4])
+ self.assertEquals("none", child_types[ 5])
+ self.assertEquals(None, child_types[ 6])
+ self.assertEquals("float", child_types[ 7])
+ self.assertEquals("float", child_types[ 8])
+ self.assertEquals("str", child_types[ 9])
+ self.assertEquals("str", child_types[10])
self.assertEquals("float", child_types[11])
- self.assertEquals("long", child_types[12])
+ self.assertEquals("long", child_types[12])
self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR))
@@ -619,18 +639,18 @@
child_types = [ c.get(XML_SCHEMA_INSTANCE_TYPE_ATTR)
for c in root.iterchildren() ]
- self.assertEquals("xsd:int", child_types[0])
- self.assertEquals("xsd:string", child_types[1])
- self.assertEquals("xsd:float", child_types[2])
- self.assertEquals("xsd:string", child_types[3])
- self.assertEquals("xsd:boolean", child_types[4])
- self.assertEquals(None, child_types[5])
- self.assertEquals(None, child_types[6])
- self.assertEquals("xsd:int", child_types[7])
- self.assertEquals("xsd:int", child_types[8])
- self.assertEquals("xsd:int", child_types[9])
+ self.assertEquals("xsd:int", child_types[ 0])
+ self.assertEquals("xsd:string", child_types[ 1])
+ self.assertEquals("xsd:double", child_types[ 2])
+ self.assertEquals("xsd:string", child_types[ 3])
+ self.assertEquals("xsd:boolean", child_types[ 4])
+ self.assertEquals(None, child_types[ 5])
+ self.assertEquals(None, child_types[ 6])
+ self.assertEquals("xsd:int", child_types[ 7])
+ self.assertEquals("xsd:int", child_types[ 8])
+ self.assertEquals("xsd:int", child_types[ 9])
self.assertEquals("xsd:string", child_types[10])
- self.assertEquals("xsd:float", child_types[11])
+ self.assertEquals("xsd:double", child_types[11])
self.assertEquals("xsd:integer", child_types[12])
self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR))
@@ -659,18 +679,18 @@
child_types = [ c.get(XML_SCHEMA_INSTANCE_TYPE_ATTR)
for c in root.iterchildren() ]
- self.assertEquals("xsd:int", child_types[0])
- self.assertEquals("xsd:string", child_types[1])
- self.assertEquals("xsd:float", child_types[2])
- self.assertEquals("xsd:string", child_types[3])
- self.assertEquals("xsd:boolean", child_types[4])
- self.assertEquals(None, child_types[5])
- self.assertEquals(None, child_types[6])
- self.assertEquals("xsd:double", child_types[7])
- self.assertEquals("xsd:float", child_types[8])
- self.assertEquals("xsd:string", child_types[9])
+ self.assertEquals("xsd:int", child_types[ 0])
+ self.assertEquals("xsd:string", child_types[ 1])
+ self.assertEquals("xsd:double", child_types[ 2])
+ self.assertEquals("xsd:string", child_types[ 3])
+ self.assertEquals("xsd:boolean", child_types[ 4])
+ self.assertEquals(None, child_types[ 5])
+ self.assertEquals(None, child_types[ 6])
+ self.assertEquals("xsd:double", child_types[ 7])
+ self.assertEquals("xsd:float", child_types[ 8])
+ self.assertEquals("xsd:string", child_types[ 9])
self.assertEquals("xsd:string", child_types[10])
- self.assertEquals("xsd:float", child_types[11])
+ self.assertEquals("xsd:double", child_types[11])
self.assertEquals("xsd:integer", child_types[12])
self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR))
@@ -730,7 +750,7 @@
for c in root.iterchildren() ]
self.assertEquals("xsd:int", child_types[ 0])
self.assertEquals("xsd:string", child_types[ 1])
- self.assertEquals("xsd:float", child_types[ 2])
+ self.assertEquals("xsd:double", child_types[ 2])
self.assertEquals("xsd:string", child_types[ 3])
self.assertEquals("xsd:boolean", child_types[ 4])
self.assertEquals(None, child_types[ 5])
@@ -739,7 +759,7 @@
self.assertEquals("xsd:int", child_types[ 8])
self.assertEquals("xsd:int", child_types[ 9])
self.assertEquals("xsd:string", child_types[10])
- self.assertEquals("xsd:float", child_types[11])
+ self.assertEquals("xsd:double", child_types[11])
self.assertEquals("xsd:integer", child_types[12])
self.assertEquals("true", root.n.get(XML_SCHEMA_NIL_ATTR))
Deleted: /lxml/branch/html/src/lxml/tests/test_xinclude.xml
==============================================================================
--- /lxml/branch/html/src/lxml/tests/test_xinclude.xml Fri Jun 29 19:05:43 2007
+++ (empty file)
@@ -1,4 +0,0 @@
-<doc xmlns:xi="http://www.w3.org/2001/XInclude">
-<foo/>
-<xi:include href="test.xml" />
-</doc>
\ No newline at end of file
Modified: lxml/branch/html/src/lxml/tree.pxd
==============================================================================
--- lxml/branch/html/src/lxml/tree.pxd (original)
+++ lxml/branch/html/src/lxml/tree.pxd Fri Jun 29 19:05:43 2007
@@ -41,6 +41,9 @@
cdef xmlCharEncoding xmlDetectCharEncoding(char* text, int len)
cdef char* xmlGetCharEncodingName(xmlCharEncoding enc)
+cdef extern from "libxml/chvalid.h":
+ cdef int xmlIsChar_ch(char c)
+
cdef extern from "libxml/hash.h":
ctypedef struct xmlHashTable
ctypedef void xmlHashScanner(void* payload, void* data, char* name)
Modified: lxml/branch/html/src/lxml/xmlerror.pxi
==============================================================================
--- lxml/branch/html/src/lxml/xmlerror.pxi (original)
+++ lxml/branch/html/src/lxml/xmlerror.pxi Fri Jun 29 19:05:43 2007
@@ -480,7 +480,9 @@
# Constants are stored in tuples of strings, for which Pyrex generates very
# efficient setup code. To parse them, iterate over the tuples and parse each
-# line in each string independently.
+# line in each string independently. Tuples of strings (instead of a plain
+# string) are required as some C-compilers of a certain well-known OS vendor
+# cannot handle strings that are a few thousand bytes in length.
cdef object __ERROR_LEVELS
__ERROR_LEVELS = ("""\
Modified: lxml/branch/html/src/lxml/xmlparser.pxd
==============================================================================
--- lxml/branch/html/src/lxml/xmlparser.pxd (original)
+++ lxml/branch/html/src/lxml/xmlparser.pxd Fri Jun 29 19:05:43 2007
@@ -23,6 +23,9 @@
char* value,
int len)
+ ctypedef void (*commentSAXFunc)(void* ctx,
+ char* value)
+
cdef extern from "libxml/tree.h":
ctypedef struct xmlParserInput
ctypedef struct xmlParserInputBuffer:
@@ -34,6 +37,7 @@
startElementNsSAX2Func startElementNs
endElementNsSAX2Func endElementNs
cdataBlockSAXFunc cdataBlock
+ commentSAXFunc comment
cdef extern from "libxml/xmlIO.h":
cdef xmlParserInputBuffer* xmlAllocParserInputBuffer(int enc)
Modified: lxml/branch/html/src/lxml/xmlschema.pxi
==============================================================================
--- lxml/branch/html/src/lxml/xmlschema.pxi (original)
+++ lxml/branch/html/src/lxml/xmlschema.pxi Fri Jun 29 19:05:43 2007
@@ -38,12 +38,12 @@
root_node = _rootNodeOrRaise(etree)
# work around for libxml2 bug if document is not XML schema at all
- if _LIBXML_VERSION_INT < 20624:
- c_node = root_node._c_node
- c_href = _getNs(c_node)
- if c_href is NULL or \
- cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0:
- raise XMLSchemaParseError, "Document is not XML Schema"
+ #if _LIBXML_VERSION_INT < 20624:
+ c_node = root_node._c_node
+ c_href = _getNs(c_node)
+ if c_href is NULL or \
+ cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0:
+ raise XMLS