[lxml-dev] A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed)

Mike MacCana mmaccana at au1.ibm.com
Tue Jul 1 11:12:58 CEST 2008


Ladies and gentleman, 

On Tue, 2008-07-01 at 07:24 +0200, Stefan Behnel wrote:
> Hi,
> 
> Mike MacCana wrote:
> > Hi gents,
> 
> Are you sure you don't want advice from any girls?
> 
> 
> > I'm a first time user of lxml attempting to etree.parse a document.
> My
> > code (below) works fine on some sample text, but libxml complains
> about
> > the real data with:
> > 
> > etree.XMLSyntaxError: line 196: Premature end of data in tag html
> line 5
> > 
> > The data is below. Line 5 seems OK to me, but I'm new to XML coding
> so
> > maybe I'm missing something.
> 
> The problem is not in line 5 (where the html tag starts) but in line
> 196,
> where it apparently ends. Try validating it at the W3C validator if
> you don't
> believe lxml. ;)


Thanks Stefan.

I solved the crap HTML problem as follows. Hopefully the following will
be useful to anyone beginning XPath with lxml.

#!/usr/bin/env python
import urllib, sys, lxml, StringIO, lxml.html,os

from lxml import etree
from StringIO import StringIO
from lxml.html.clean import Cleaner

## Point this at your XP VM used to get to Telstra
proxies = {'http': 'http://xpvm:3128'}
url='http://domain.com/page'

## Function to strip non-ascii characters
## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters
## for list
def onlyascii(char):
	if ord(char) < 32 or ord(char) > 176: 
		return ''
	else: 
		return char

## Open the URL and read its contents
filehandle = urllib.urlopen(url, proxies=proxies)
html=filehandle.read()
asciihtml=filter(onlyascii, html)

## Customer's HTML content is REALLY bad. Clean it.
## See http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html
## and 'pydoc lxml.html.clean.Cleaner'

## Clean HTML and strip a bunch of tags that are broken and that we dont
care about.
badtags=['img','a','div','span','h2','h1','style','title','ul','li','col']
cleaner = Cleaner(page_structure=False, links=False,
remove_tags=badtags )

## We can now access our cleaned content as 'cleanedcontent'
cleanedcontent=cleaner.clean_html(asciihtml)

## Save Clean content to disk for debugging purposes
os.remove('debug.html')
outputfile = open('debug.html','w')
outputfile.write(cleanedcontent)
outputfile.close()

## Go parse our content
cleanedcontentstringio = StringIO(cleanedcontent)
parser = etree.XMLParser(recover=True)
tree = etree.parse(cleanedcontentstringio)

## Xpath locations of what we're interested in (element zero is all we
care about
## text is the text within the tags, and strip off any whitespace
## You can find XPath locations by loading up 'debug.html' in Firefox
with the Firebug extension
name = tree.xpath('/html/body/table/tbody/tr/td')[0].text.strip()
email =
tree.xpath('/html/body/table/tbody/tr[7]/td')[0].text.strip().lower()

print name+","+email

Cheers,

Mike

________________________________________________
Mike MacCana
Technical Specialist
Australia Linux and Virtualisation Services

IBM Global Services
Level 14, 60 City Rd
Southgate Vic 3000 

Phone: +61-3-8656-2138
Fax: +61-3-8656-2423
Email: mmaccana at au1.ibm.com



More information about the lxml-dev mailing list