[lxml-dev] Getting 'user-visible' text from HTML
Ted Dziuba
ted at milo.com
Fri Jul 24 18:00:41 CEST 2009
So, a user will likely never see meta keywords displayed because they're an
attribute of the <meta> tag. However, if you just want all of the text
contained within a document, try this:
from lxml import html
tree = html.fromstring(text_of_html)
all_text = tree.text_content()
And then run your regexes against all_text. One caveat is that
text_content() recursively gives you all text children of the nodes, so it
will pull in JavaScript contained within <script> tags. If that's a
problem, you can come up with some minor hackery to pull <script> nodes out
of the tree.
Ted
On Fri, Jul 24, 2009 at 7:30 AM, Adam Nelson <adam at varud.com> wrote:
> Is there a shortcut method (or even a pasted script) that allows lxml to
> get all
> the 'user-visible' text?
>
> I'm writing a screen scraper that then takes that text and looks for
> banned words next to an
> advertiser's content - and therefore I need to run a regular
> expression on everything a user
> might see (including meta keywords, etc...) but I don't care
> about the actual tags
> themselves, or urls, etc...
>
> Right now, I'm just doing the regex on the entire HTML block.
>
> Thanks,
> Adam
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
--
Ted Dziuba
Co-Founder and Engineer
Milo.com, Inc.
165 University Avenue
Palo Alto, CA, 94301
http://milo.com
Cell: (609)-665-2639
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090724/681a343d/attachment.htm
More information about the lxml-dev
mailing list