<div><br><br></div><div></div><br><pre>2009-06-03,"Stefan Behnel" <stefan_ml@behnel.de> :
>
>qhlonline wrote:
>> When I used the lxml with self defined Target Parser, There is a
>> function that can be redefined-- data . def data (self, data): When can
>> we use it?
>
>when you want to receive character content from the document you parse.
>
>
>> and what it will do when we simply write a single line: "return " ?
>
>nothing? actually, a "pass" will do in that case, as will not implementing
>the method (IIRC).
>
>
>> Is there any encoding conversion?
>
>You will get either ASCII encoded byte strings or unicode strings, just
>like everywhere else.
>
>BTW, it's sometimes faster to try these things out than to ask a mailing list.
>
>Stefan
<br>Hi, Stefan<br> My last mail has mixed the <meta charset> problem and target parser data function problem as one. I have made some tests and the result shows they are separate problems. When I do not define <font color="#800080">data</font> function in my target parser, That will slove my problem of <font color="#800080">http://www.jiayuan.com/</font> web decoding error in parsing process. But still can't slove the problem of partly parsing caused by <meta> encoding declaration, eg. <font color="#800080">http://www.sina.com/</font> could be parsed, while a incomplete result was given. And I have dealed with this problem with two methods: The first one is to change the parsing content. After read out HTML string from the site <font color="#800080">http://www.sina.com/</font> ,I changed all <meta>'s <font color="#800080">content="charset **"</font> attribute value as <font color="#800080">content=""</font> to avoid encoding change in libxml2. This method is somewhat dangerous, Because at most times the <meta> declaration should be considered;The second method is for Chinese webs only, you know the largest character set of Chinese is GB18030 for now, So I changed the libxml2 source code and let GB18030 be the constant decoder. But this method can only resolve Chinese web problems of <meta charset> declaration error(It declared a different encoding to the web content), and I don't know whether webs of other language contains <meta> declaration irregular problems like that in Chinese.<br> Although the <font color="#800080">http://www.jiayuan.com/</font> decoding error had been solved, I just don't know why. The method of shielding <font color="#800080">data</font> function of my target parser is got by my lots of tests, and I'm searching for the reason. Could you give me some suggestion?<br></pre><br><!-- footer --><br><span title="neteasefooter"/><hr/>
<a href="http://qiye.163.com/?ft=1">业务订单流失怎么办?</a>
</span>