[Lxml-checkins] r43949 - lxml/branch/html/src/lxml/html

ianb at codespeak.net ianb at codespeak.net
Thu May 31 20:46:51 CEST 2007


Author: ianb
Date: Thu May 31 20:46:51 2007
New Revision: 43949

Modified:
   lxml/branch/html/src/lxml/html/clean.py
Log:
a few notes about things I should do

Modified: lxml/branch/html/src/lxml/html/clean.py
==============================================================================
--- lxml/branch/html/src/lxml/html/clean.py	(original)
+++ lxml/branch/html/src/lxml/html/clean.py	Thu May 31 20:46:51 2007
@@ -10,6 +10,16 @@
 #   expression(...)
 # Other on* attributes that aren't standard?
 # Try these tests: http://feedparser.org/tests/wellformed/sanitize/
+# Also http://code.sixapart.com/trac/livejournal/browser/trunk/cgi-bin/cleanhtml.pl
+# IE treats <image> like <img>
+# <layer>...?
+# <head> and <title> is fishy in a fragment
+# max width for words
+# max height?
+# autolink?
+# CSS stuff?
+# remove images?
+
 
 def clean_html(html, **kw):
     """


More information about the lxml-checkins mailing list