[lxml-dev] Getting 'user-visible' text from HTML

Terry Brown terry_n_brown at yahoo.com
Fri Jul 24 18:05:16 CEST 2009


On Fri, 24 Jul 2009 14:30:03 +0000 (UTC)
Adam Nelson <adam at varud.com> wrote:

> Is there a shortcut method (or even a pasted script) that allows lxml
> to get all the 'user-visible' text?  

doc.xpath("//text()") should return a list of every piece of text in
the html.

Cheers -Terry

> I'm writing a screen scraper that then takes that text and looks for 
> banned words next to an 
> advertiser's content - and therefore I need to run a regular 
> expression on everything a user 
> might see (including meta keywords, etc...) but I don't care
>  about the actual tags 
> themselves, or urls, etc...
> 
> Right now, I'm just doing the regex on the entire HTML block.
> 
> Thanks,
> Adam
> 
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
> 



More information about the lxml-dev mailing list