[lxml-dev] Getting 'user-visible' text from HTML

Adam Nelson adam at varud.com
Fri Jul 24 16:30:03 CEST 2009


Is there a shortcut method (or even a pasted script) that allows lxml to get all 
the 'user-visible' text?  

I'm writing a screen scraper that then takes that text and looks for 
banned words next to an 
advertiser's content - and therefore I need to run a regular 
expression on everything a user 
might see (including meta keywords, etc...) but I don't care
 about the actual tags 
themselves, or urls, etc...

Right now, I'm just doing the regex on the entire HTML block.

Thanks,
Adam



More information about the lxml-dev mailing list