[lxml-dev] html encoding
Sergio Monteiro Basto
sergio at sergiomb.no-ip.org
Tue Dec 9 21:41:27 CET 2008
when str is the html
I use: htmldecode( unicode(str,'utf-8') ).encode('utf-8')
import re
from htmlentitydefs import name2codepoint
# This pattern matches a character entity reference (a decimal numeric
# references, a hexadecimal numeric reference, or a named reference).
charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?')
def htmldecode(text):
"""Decode HTML entities in the given text."""
if type(text) is unicode:
uchr = unichr
else:
uchr = lambda value: value > 255 and unichr(value) or chr(value)
def entitydecode(match, uchr=uchr):
entity = match.group(1)
if entity.startswith('#x'):
return uchr(int(entity[2:], 16))
elif entity.startswith('#'):
return uchr(int(entity[1:]))
elif entity in name2codepoint:
return uchr(name2codepoint[entity])
else:
return match.group(0)
return charrefpat.sub(entitydecode, text)
On Thu, 2008-12-04 at 12:57 +0100, Dirk Rothe wrote:
> On Thu, 04 Dec 2008 12:46:34 +0100, Daniel Jirku <nepi at gmx.ch> wrote:
>
> > hi...
> >
> > My problem is i suppose well known, but i couldnt find any soultion
> > through my searches...
> >
> > I have a regular html link with ? and an &. When i print the variable in
> > pyhton, it looks fine... (like:
> > http://www.somelink.com/site.html?param1=test¶m2=hello), BUT when i
> > add it to my root xml element with:
> > adId1 = etree.SubElement(tagAd, "originalAdUrl")
> > adId1.text = adUrl
> >
> > and then later write the xml to a file with this:
> > toStringValue = etree.tostring(xmlTagRoot, encoding="utf-8",
> > method="xml", xml_declaration=True, pretty_print=True)
> > ...
> >
> > the tag has as its value the link with an & instead of & !!
> > How can i use the correct signs for persistant storage in a xml file...?
>
> The XML Processor has correctly escaped your "&" character. If you
> deserialise (aka load) the file with a XML Parser of your choice, it will
> restore your "&" character.
>
> see
> http://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_entity_references
>
> --dirk
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
--
Sérgio M. B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2192 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20081209/ad54337c/attachment.bin
More information about the lxml-dev
mailing list