FRUIT
"). In lxml 2.2.0, on Windows, this worked just fine, and elements could be processed based on this data. In lxml 2.2.2, on Linux, this fails. The above example becomes "FRUIT
" as soon as it is parsed by lxml.html (or lxml.etree.HTMLParser()). I don't know if this is caused by the switch to Linux, or the upgrade to 2.2.2. I don't have control over the installation, so I can't switch to 2.2.2 under Windows, or 2.2.0 under Linux to check. I did find this reference (the only reference to this I could find) to the HTML ignoring namespaces: http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests ...however, it wasn't doing that before, and it seems odd that this is only mentioned in the doctests section. Is there a way to work around this? Are custom namespaces simply not possible in lxml's HTML? Notes: 1. The XML parser will not work. Some documents will have legal HTML that breaks an XML parser, like "My namespaces are going to disappear!
FRUIT
""") >>> print parser.tostring(document) ----- The output: -----My namespaces are going to disappear!
FRUIT
----- Thomas Weigel From jholg at gmx.de Tue Jun 23 09:33:41 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 23 Jun 2009 09:33:41 +0200 Subject: [lxml-dev] Converting an objectified lxml tree to a standard etree one. In-Reply-To: <1245697070.23804.6.camel@localhost.localdomain> References: <1245442706.28204.32.camel@localhost.localdomain> <4A3C78DB.5030100@behnel.de> <1245697070.23804.6.camel@localhost.localdomain> Message-ID: <20090623073341.69330@gmx.net> Hi, > [ snipped for length ] > > Hmm, yes, that looks weird. It works with lxml.etree, but not with > > lxml.objectify. Could you please file a bug report on this? This seems to be the villain: >>> for (name, obj) in objectify.__dict__.items(): ... if hasattr(obj, '__bases__'): ... try: ... i = iter(getattr(obj, '__bases__')) ... except: ... print name, obj, getattr(obj, '__bases__') ... E??
I am wondering why I have an extra character (?) in my output. What should I do to avoid that? Thanks, Francesco From stefan_ml at behnel.de Wed Jun 24 14:10:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 14:10:16 +0200 (CEST) Subject: [lxml-dev] clean_html In-Reply-To:????
> > I am wondering why I have an extra character (??) in my output. > What should I do to avoid that? That's just because the serialised HTML output is encoded as UTF-8. If you want to print the resulting byte string, use .decode('UTF-8') to decode it to unicode first. If you want to write it to a file (or send it through the network), keeping it in UTF-8 is the right thing, though. Stefan From kevin.p.dwyer at gmail.com Wed Jun 24 14:10:49 2009 From: kevin.p.dwyer at gmail.com (Kev Dwyer) Date: Wed, 24 Jun 2009 13:10:49 +0100 Subject: [lxml-dev] clean_html In-Reply-To:?
I suspect this is only a problem if the encoding of the html string passed to clean_html is undefined, or incorrectly defined. Kevin 2009/6/24 Francesco??
> > I am wondering why I have an extra character (?) in my output. > What should I do to avoid that? > > Thanks, > > Francesco > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090624/66fc56fe/attachment.htm From cattafra at hotmail.com Wed Jun 24 14:46:40 2009 From: cattafra at hotmail.com (Francesco) Date: Wed, 24 Jun 2009 12:46:40 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= References:FRUIT
"). > > In lxml 2.2.0, on Windows, this worked just fine, and elements could be > processed based on this data. > > In lxml 2.2.2, on Linux, this fails. The above example becomes "content='fruit'>FRUIT
" as soon as it is parsed by lxml.html (or > lxml.etree.HTMLParser()). You forgot to mention which versions of libxml2 you are using on both systems. That's likely the reason for the difference. http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do Stefan From cattafra at hotmail.com Fri Jun 26 11:48:57 2009 From: cattafra at hotmail.com (Francesco) Date: Fri, 26 Jun 2009 09:48:57 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= References:FRUIT
"). > > You forgot to mention which versions of libxml2 you are using on both > systems. That's likely the reason for the difference. Thank you for being kind. > http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do I have begun investigating down this path. I will not bother you again until I have finished there. In the meantime, I am working around the problem with a regular expression to replace 'custom_namespace:' with 'custom_namespace_', depending on whether or not lxml deletes the custom namespace. Thank you for your time. Thomas Weigel From stefan_ml at behnel.de Sat Jun 27 07:08:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 07:08:48 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To:FRUIT
"). > > Notes: > > 1. The XML parser will not work. Some documents will have legal HTML > that breaks an XML parser, like "My namespaces are > going to disappear!
FRUIT
""") > >>> print parser.tostring(document) > ----- That's an XHTML document, for which the XML parser would be the right tool. If you have XHTML documents that contain unterminatedMy namespaces are > going to disappear!
FRUIT
> ----- That's because HTML parsers are not namespace aware. Namespaces are simply not defined for HTML. But if you get a difference on different systems, I'd still suspect the reason to be different libxml2 versions. There's nothing lxml can do about this. Stefan From stefan_ml at behnel.de Sat Jun 27 08:57:52 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 08:57:52 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: