ansaurus

Question

Python: find <title>

Answer 1

+7 A:

That URL returns a document with <TITLE>...</TITLE> and find is case-sensitive. I strongly suggest you use an HTML parser like Beautiful Soup.

Marcelo Cantos 2010-05-20 10:05:01

Answer 2

+1 A:

Let's analyze why we got that answer. If you open the website and view the source, we note that it doesn't have <title>...</title>. Instead we have <TITLE>...</TITLE>. So what happened to the 2 find calls? Both will be -1!

begin = html.find('<title>')   # Result: -1
end   = html.find('</title>')  # Result: -1

Then begin+len('<title>') will be -1 + 7 = 6. So your final line would be extracting html[6:-1]. It turns out that negative indices actually mean something legitimate in Python (for good reasons). It means to count from the back. Hence -1 here refers to the last character in html. So what you are getting is a substring from the 6th character (inclusive) to the last character (exclusive).

What can we do then? Well, for one, you can use regular expression matcher that ignore case or use a proper HTML parser. If this is a one-off thing and space/performance isn't much of a concern, the quickest approach might be to create a copy of html and lower-cased the entire string:

def get_title(html):
    html_lowered = html.lower();
    begin = html_lowered.find('<title>')
    end = html_lowered.find('</title>')
    if begin == -1 or end == -1:
        return None
    else:
        # Find in the original html
        return html[begin+len('<title>'):end].strip()

Chris Henry 2010-05-20 10:27:47

ansaurus

tags:

views:

answers:

Python: find <title>

related questions