[lxml-dev] Premature end of data in tag - but it looks well formed

Mike MacCana mmaccana at au1.ibm.com
Tue Jul 1 05:13:30 CEST 2008


Hi gents,

Firstly, thanks for lxml. It's by far the nicest tool for someone who
needs to do xpath in python without being an XML god.

I'm a first time user of lxml attempting to etree.parse a document. My
code (below) works fine on some sample text, but libxml complains about
the real data with:

etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5

The data is below. Line 5 seems OK to me, but I'm new to XML coding so
maybe I'm missing something.
__________________________________
1
2
3 <?xml version="1.0" encoding="iso-8859-1"?>
4 <!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.0
Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
5 <html xmlns="http://www.w3.org/1999/xhtml">
__________________________________

Any ideas? The full code is below.

Cheers,

Mike





#!/usr/bin/env python
import urllib, sys, lxml, StringIO
from lxml import etree
from StringIO import StringIO

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://xpvm:3128'}
url='http://peoplesearch.in.telstra.com.au:8094/peoplesearch/userdetail.aspx?BaseDN=CN=d299061,OU=People,OU=eProfile,DC=PeopleSearch,DC=Telstra,DC=Com'

filehandle = urllib.urlopen(url, proxies=proxies)
print filehandle

## Real html
html=filehandle.read()
## Test html
#html="<foo><bar><baz>underpants</baz></bar></foo>"

print "--------------------------------"
print html
print '=========================='

f = StringIO(html)
tree = etree.parse(f)

## Real xpath
r =
tree.xpath('/html/body/div[4]/form/div[3]/div/div/div/div/table/tbody/tr[6]/td')
## Test xpath
#r = tree.xpath('/foo/bar/baz')

print 'length:'
print len(r)
print 'tag:'
print r[0].tag
print 'contents:'
print r[0].text



________________________________________________
Mike MacCana
Technical Specialist
Australia Linux and Virtualisation Services

IBM Global Services
Level 14, 60 City Rd
Southgate Vic 3000 

Phone: +61-3-8656-2138
Fax: +61-3-8656-2423
Email: mmaccana at au1.ibm.com



More information about the lxml-dev mailing list