From dpritsos at extremepro.gr Thu Jan 6 17:46:05 2011 From: dpritsos at extremepro.gr (Dimitrios Pritsos) Date: Thu, 06 Jan 2011 18:46:05 +0200 Subject: [lxml-dev] memory leak in lxml.html.parse() In-Reply-To: <4CF80C0A.2070803@behnel.de> References: <4CF78066.8050706@extremepro.gr> <4CF79596.30206@behnel.de> <4CF7EF5A.9090706@behnel.de> <4CF80C0A.2070803@behnel.de> Message-ID: <4D25F1CD.2050905@extremepro.gr> On 02/12/10 23:13, Stefan Behnel wrote: > Stefan Behnel, 02.12.2010 20:11: >> Stefan Behnel, 02.12.2010 13:48: >>> Dimitrios Pritsos, 02.12.2010 12:17: >>>> I am sorry that I am sending this as a response >>> >>> No need to do so if you want to start a new topic. Just send a message >>> directly to the list address. Replies are for replying. >>> >>> >>>> There is a memory leakage using lxml.html.parse (or etree) while you >>>> do that constantly in a loop. In particular creating etrees in a loop >>>> does let the trees there and is not deleting the properly when you >>>> reuse >>>> the same python variable to store the resutls. >>> >>> I can reproduce this. I'll take a look ASAP. >> >> It's easily reproducible. I can parse a document repeatedly in a loop >> using >> lxml.html.parse() and see the memory consumption of the Python process >> grow. I reproduced it with 2.3-pre, don't know if 2.2 suffers from >> the same >> problem. I'll see about that when I figured out what happens. >> >> It's only a problem with the HTML parser, and it's not related to >> lxml.html. This is enough to reproduce it: >> >> from lxml import etree >> >> p = etree.HTMLParser() >> while True: >> etree.parse("somefile.html", p) > > I think it may be an issue with libxml2. The memory consumption seems > to be stable with 2.7.7 and 2.7.8 but not with my system's 2.7.6. > > What's the version you use? Could you try the latest one? > > http://codespeak.net/lxml/dev/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do > > > Stefan > Hello All, I am sorry for the late response I ve tried it with 2.7.7 and 2.7.8, The Memory leakage persists. even if you do this: xhtml_tree = lxml.html.parser( open( 'myhtmlfile.html', 'r') ) del xhtml_tree HAPPY NEW YEAR Regards, Dimitrios From dcramer at gmail.com Mon Jan 10 20:30:12 2011 From: dcramer at gmail.com (David Cramer) Date: Mon, 10 Jan 2011 11:30:12 -0800 Subject: [lxml-dev] XSD Validation Message-ID: <02A5A375AB88457C80A2CCFC1B378029@gmail.com> We're having some problems attempting to validate an XML file against an XSD schema. To summarize, lxml is complaining about whiteSpace and date validation, when as far as I can tell, the spec says its valid. Full context is on StackOverflow, don't want to throw everything into here, but after dealing with this for a couple days it feels like maybe theres an issue w/ the validation layer in lxml. http://stackoverflow.com/questions/4631897/datetime-complaining-about-whitespace-in-xsd-validation-lxml Per the spec, whiteSpace is collapse by default which should remove all indentation space, yet lxml complains about the dateTime value not being valid. -- David Cramer -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20110110/51baf490/attachment.htm From stefan_ml at behnel.de Tue Jan 11 09:42:42 2011 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 11 Jan 2011 09:42:42 +0100 Subject: [lxml-dev] XSD Validation In-Reply-To: <02A5A375AB88457C80A2CCFC1B378029@gmail.com> References: <02A5A375AB88457C80A2CCFC1B378029@gmail.com> Message-ID: <4D2C1802.9090305@behnel.de> David Cramer, 10.01.2011 20:30: > We're having some problems attempting to validate an XML file against an > XSD schema. To summarize, lxml is complaining about whiteSpace and date > validation, when as far as I can tell, the spec says its valid. > > Full context is on StackOverflow, don't want to throw everything into > here, but after dealing with this for a couple days it feels like maybe > theres an issue w/ the validation layer in lxml. > > http://stackoverflow.com/questions/4631897/datetime-complaining-about-whitespace-in-xsd-validation-lxml > > Per the spec, whiteSpace is collapse by default which should remove all > indentation space, yet lxml complains about the dateTime value not being > valid. Have you tried validating the file directly with libxml2's xmllint? Stefan From brandonlewis at anvilrockroad.com Thu Jan 13 05:35:20 2011 From: brandonlewis at anvilrockroad.com (Brandon M Lewis) Date: Wed, 12 Jan 2011 21:35:20 -0700 Subject: [lxml-dev] xslt transformations and django templates Message-ID: <1294893320.3263.48.camel@fa7ll7en-laptop> I am not sure if this is the proper place for this, but here's my question. I have about 4000 html documents that I am trying to transform into django templates. Everything has been working well but I am trying to add some django template variables using xslt. As I understand it a curly brace is treated specially by xslt in general and so i have to do something like `div class="{{{{ somevariable }}}}"` to end up with a django template tag `div class="{{ somevariable }}"` in the transformed document. no big deal, but when I try `a name="{{ somevariable }}"` i end up with escaped curly braces. I have some code posted at http://stackoverflow.com/questions/4674217/building-django-template-files-with-xslt that will probably make more sense than I do, but I am trying to find out if there is a way to turn off the urlencoding for anchor tags, does that make sense? I am using lxml 2.2.8. thanks brandon From brandonlewis at anvilrockroad.com Thu Jan 13 21:01:22 2011 From: brandonlewis at anvilrockroad.com (Brandon M Lewis) Date: Thu, 13 Jan 2011 20:01:22 +0000 (UTC) Subject: [lxml-dev] xslt transformations and django templates References: <1294893320.3263.48.camel@fa7ll7en-laptop> Message-ID: I have updated my question here: http://stackoverflow.com/questions/4684614/is-there-a-way-to-disable-urlencoding-of-anchor-attributes-in-lxml hopefully this is a little clearer. thanks brandon From miromintal at gmail.com Fri Jan 28 15:45:09 2011 From: miromintal at gmail.com (Miro Mintal) Date: Fri, 28 Jan 2011 15:45:09 +0100 Subject: [lxml-dev] html parsing, .text Message-ID: When i try get text from tag in html it return text only if no tag is before this text. Here is demonstrating code : import lxml.html html = """another text
some text
""" doc = lxml.html.fromstring(html) print doc.text_content() ? # "some text" is here but when i try get text for this tag then: print doc.text ? ? ? ? ? ? ? ? # return None, but it have text : "some text" for a in doc: ? ?a.text ? ? ? ? # no subtag have text "some text" it s only work if text is before tags: html = """some textanother text
""" But i need parsing web page with text after tags. Can you help me ? version : lxml.etree: ? ? ? ?(2, 2, 6, 0) libxml used: ? ? ? (2, 7, 7) libxml compiled: ? (2, 7, 6) libxslt used: ? ? ?(1, 1, 26) libxslt compiled: ?(1, 1, 26) From joaquin at cuencaabela.com Fri Jan 28 15:59:41 2011 From: joaquin at cuencaabela.com (Joaquin Cuenca Abela) Date: Fri, 28 Jan 2011 15:59:41 +0100 Subject: [lxml-dev] html parsing, .text In-Reply-To: References: Message-ID: you need to use also the "tail" property. "text" is for the text inside the element, tail is for the text after the element is closed. for a in doc: print a.text, a.tail Cheers, On Fri, Jan 28, 2011 at 3:45 PM, Miro Mintal wrote: > When i try get text from tag in html it return text only if no tag is > before this text. > > Here is demonstrating code : > > import lxml.html > html = """another text
some text
""" > doc = lxml.html.fromstring(html) > print doc.text_content() # "some text" is here but when i try get > text for this tag then: > print doc.text # return None, but it have text : "some > text" > for a in doc: > a.text # no subtag have text "some text" > > it s only work if text is before tags: > html = """some textanother text
""" > > But i need parsing web page with text after tags. Can you help me ? > > version : > lxml.etree: (2, 2, 6, 0) > libxml used: (2, 7, 7) > libxml compiled: (2, 7, 6) > libxslt used: (1, 1, 26) > libxslt compiled: (1, 1, 26) > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- Joaquin Cuenca Abela -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20110128/ec9a8cd4/attachment.htm From john at nmt.edu Fri Jan 28 16:51:38 2011 From: john at nmt.edu (John W. Shipman) Date: Fri, 28 Jan 2011 08:51:38 -0700 (MST) Subject: [lxml-dev] html parsing, .text In-Reply-To: References: Message-ID: On Fri, 28 Jan 2011, Joaquin Cuenca Abela wrote: +-- | you need to use also the "tail" property. "text" is for the | text inside the element, tail is for the text after the element | is closed. +-- In the lxml/ElementTree world, the way mixed content works is, in my experience, the hardest thing to understand. Perhaps a picture might help: http://www.nmt.edu/tcc/help/pubs/pylxml/etree-view.html Because I came from a DOM background, when I first saw how the .tail attribute works, I completely rejected the entire framework because I thought it was ugly. What changed my mind was performance. With Python's minidom it took one program about 35 seconds to read a half-megabyte XML file; lxml read that same file in 600 milliseconds. Once I started actually using lxml, I found that handling mixed content is not that bad at all. Appended below my .signature is a little function that I use everywhere to append text as the child of an element without having to worry about where it goes. Forgive me for promoting my own work, but the document containing the above link describes how to use lxml for reading, writing, and updating XML. It also includes an annotated version of Fredrik Lundh's builder.py module which makes code to generate XML much more straightforward and compact. http://www.nmt.edu/tcc/help/pubs/pylxml/etree-view.html Best regards, John Shipman (john at nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 119, Socorro, NM 87801, (575) 835-5735, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber ================================================================ def addText ( node, s ): '''Add text content to an element. [ (node is an Element) and (s is a string) -> if node has any children -> last child's .tail := last child's tail + s else -> node.text := node.text + s ] ''' if len(node) == 0: node.text = (node.text or "") + s else: lastChild = node[-1] lastChild.tail = (lastChild.tail or "") + s