[lxml-dev] html encoding

Sergio Monteiro Basto sergio at sergiomb.no-ip.org
Tue Dec 9 21:41:27 CET 2008


when str is the html 

I use: htmldecode( unicode(str,'utf-8') ).encode('utf-8')

import re
from htmlentitydefs import name2codepoint
# This pattern matches a character entity reference (a decimal numeric
# references, a hexadecimal numeric reference, or a named reference).
charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?')

def htmldecode(text):
    """Decode HTML entities in the given text."""
    if type(text) is unicode:
        uchr = unichr
    else:
        uchr = lambda value: value > 255 and unichr(value) or chr(value)
    def entitydecode(match, uchr=uchr):
        entity = match.group(1)
        if entity.startswith('#x'):
            return uchr(int(entity[2:], 16))
        elif entity.startswith('#'):
            return uchr(int(entity[1:]))
        elif entity in name2codepoint:
            return uchr(name2codepoint[entity])
        else:
            return match.group(0)
    return charrefpat.sub(entitydecode, text)
                                               

On Thu, 2008-12-04 at 12:57 +0100, Dirk Rothe wrote:
> On Thu, 04 Dec 2008 12:46:34 +0100, Daniel Jirku <nepi at gmx.ch> wrote:
> 
> > hi...
> >
> > My problem is i suppose well known, but i couldnt find any soultion  
> > through my searches...
> >
> > I have a regular html link with ? and an &. When i print the variable in  
> > pyhton, it looks fine... (like:  
> > http://www.somelink.com/site.html?param1=test&param2=hello), BUT when i  
> > add it to my root xml element with:
> > adId1 = etree.SubElement(tagAd, "originalAdUrl")
> > adId1.text = adUrl
> >
> > and then later write the xml to a file with this:
> > toStringValue = etree.tostring(xmlTagRoot, encoding="utf-8",  
> > method="xml", xml_declaration=True, pretty_print=True)
> > ...
> >
> > the tag has as its value the link with an &amp; instead of & !!
> > How can i use the correct signs for persistant storage in a xml file...?
> 
> The XML Processor has correctly escaped your "&" character. If you  
> deserialise (aka load) the file with a XML Parser of your choice, it will  
> restore your "&" character.
> 
> see  
> http://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_entity_references
> 
> --dirk
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
-- 
Sérgio M. B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2192 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20081209/ad54337c/attachment.bin 


More information about the lxml-dev mailing list