[lxml-dev] Getting 'user-visible' text from HTML
Terry Brown
terry_n_brown at yahoo.com
Fri Jul 24 18:05:16 CEST 2009
On Fri, 24 Jul 2009 14:30:03 +0000 (UTC)
Adam Nelson <adam at varud.com> wrote:
> Is there a shortcut method (or even a pasted script) that allows lxml
> to get all the 'user-visible' text?
doc.xpath("//text()") should return a list of every piece of text in
the html.
Cheers -Terry
> I'm writing a screen scraper that then takes that text and looks for
> banned words next to an
> advertiser's content - and therefore I need to run a regular
> expression on everything a user
> might see (including meta keywords, etc...) but I don't care
> about the actual tags
> themselves, or urls, etc...
>
> Right now, I'm just doing the regex on the entire HTML block.
>
> Thanks,
> Adam
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
More information about the lxml-dev
mailing list