[lxml-dev] Getting 'user-visible' text from HTML

Ted Dziuba ted at milo.com
Fri Jul 24 18:00:41 CEST 2009


So, a user will likely never see meta keywords displayed because they're an
attribute of the <meta> tag.  However, if you just want all of the text
contained within a document, try this:

from lxml import html
tree = html.fromstring(text_of_html)
all_text = tree.text_content()

And then run your regexes against all_text.  One caveat is that
text_content() recursively gives you all text children of the nodes, so it
will pull in JavaScript contained within <script> tags.  If that's a
problem, you can come up with some minor hackery to pull <script> nodes out
of the tree.

Ted


On Fri, Jul 24, 2009 at 7:30 AM, Adam Nelson <adam at varud.com> wrote:

> Is there a shortcut method (or even a pasted script) that allows lxml to
> get all
> the 'user-visible' text?
>
> I'm writing a screen scraper that then takes that text and looks for
> banned words next to an
> advertiser's content - and therefore I need to run a regular
> expression on everything a user
> might see (including meta keywords, etc...) but I don't care
>  about the actual tags
> themselves, or urls, etc...
>
> Right now, I'm just doing the regex on the entire HTML block.
>
> Thanks,
> Adam
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>



-- 
Ted Dziuba
Co-Founder and Engineer

Milo.com, Inc.
165 University Avenue
Palo Alto, CA, 94301
http://milo.com

Cell: (609)-665-2639
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090724/681a343d/attachment.htm 


More information about the lxml-dev mailing list