[lxml-dev] Forced attribute value escaping

Stefan Behnel stefan_ml at behnel.de
Fri Jun 6 13:48:58 CEST 2008


Hi again,

Stefan Behnel wrote:
> RommeDeSerieux wrote:
>> #! /usr/bin/env python
>> ## vim: fileencoding=utf-8
>> from lxml import etree
>>
>> node = etree.Element(u'tag_тег')
>> node.attrib[u'attribute_атрибут'] = u'value_значение'
>> node.text = u'text_текст'
>>
>> # what i'm getting (with some linebreaks for email):
>> # <tag_тег attribute_атрибут="value_&#x437;&#x43D;&#x430;
>> # &#x447;&#x435;&#x43D;&#x438;&#x435;">text_текст
>> # </tag_тег>
>> #
>> # expected result:
>> # <tag_тег attribute_атрибут="value_значение">text_текст
>> # </tag_тег>
>> print etree.tostring(node, encoding='utf-8')
> 
> the serialisation is done by libxml2

Taking a deeper look at it, it seems that there's actually some legacy code in
libxml2 that triggers this behaviour when the document encoding is not
provided. We can work around that by always setting it to "UTF-8" for new
documents. Here's a patch.

Stefan

=== src/lxml/parser.pxi
==================================================================
--- src/lxml/parser.pxi (revision 4485)
+++ src/lxml/parser.pxi (local)
@@ -588,8 +588,11 @@
             _raiseParseError(c_ctxt, filename, context._error_log)
         else:
             _raiseParseError(c_ctxt, filename, None)
-    elif result.URL is NULL and filename is not None:
-        result.URL = tree.xmlStrdup(_cstr(filename))
+    else:
+        if result.URL is NULL and filename is not None:
+            result.URL = tree.xmlStrdup(_cstr(filename))
+        if result.encoding is NULL:
+            result.encoding = tree.xmlStrdup("UTF-8")
     return result

 cdef int _fixHtmlDictNames(tree.xmlDict* c_dict, xmlDoc* c_doc) nogil:
@@ -1366,6 +1369,8 @@
     result = tree.xmlNewDoc(NULL)
     if result is NULL:
         python.PyErr_NoMemory()
+    if result.encoding is NULL:
+        result.encoding = tree.xmlStrdup("UTF-8")
     __GLOBAL_PARSER_CONTEXT.initDocDict(result)
     return result



More information about the lxml-dev mailing list