tags:

views:

52

answers:

3

If you are parsing html or xml (with python), and looking for certain tags, it can hurt performance to lower or uppercase an entire document so that your comparisons are accurate. What percentage (estimated) of xml and html docs use any upper case characters in their tags?

+5  A: 

XML (and XHTML) tags are case-sensitive ... so <this> and <tHis> would be different elements.

However a lot (rough estimate) of HTML (not XHTML) tags are random-case.

ChrisW
+2  A: 

Only if you're using XHTML as this is case sensitive, whereas HTML is not so you can ignore case differences. Test for the doctype before worrying about checking for case.

Ira Rainey
Of course, the web is filled with sites whose doctype says XHTML but whose absence of a namespace on the rootnode already makes any parser jump to quirksmode if it's not served as XML and just ignore the cases as the author probably intended as the code is leached together from 20 other pages.
Lajla
+1  A: 

I think you're overly concerned about performance. If you're talking about arbitrary web pages, 90% of them will be HTML, not XHTML, so you should do case-insensitive comparisons. Lowercasing a string is extremely fast, and should be less than 1% of the total time of your parser. If you're not sure, carefully time your parser on a document that's already all lowercase, with and without the lowercase conversions.

Even a pure-Python implementation of lower() would be negligible compared to the rest of the parsing, but it's better than that - CPython implements lower() in C code, so it really is as fast as possible.

Remember, premature optimization is the root of all evil. Make your program correct first, then make it fast.

dmazzoni