views:

60

answers:

2

I have this:

response = urllib2.urlopen(url)
html     = response.read()

begin = html.find('<title>')
end   = html.find('</title>',begin)
title = html[begin+len('<title>'):end].strip()

if the url = http://www.google.com then the title have no problem as "Google",

but if the url = "http://www.britishcouncil.org/learning-english-gateway" then the title become

"<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<base href="http://www.britishcouncil.org/" />
<META http-equiv="Content-Type" Content="text/html;charset=utf-8">
<meta name="WT.sp" content="Learning;Home Page Smart View" />
<meta name="WT.cg_n" content="Learn English Gateway" />
<META NAME="DCS.dcsuri" CONTENT="/learning-english-gateway.htm">..."

What is actually happening, why I couldn't return the "title"?

+7  A: 

That URL returns a document with <TITLE>...</TITLE> and find is case-sensitive. I strongly suggest you use an HTML parser like Beautiful Soup.

Marcelo Cantos
+1  A: 

Let's analyze why we got that answer. If you open the website and view the source, we note that it doesn't have <title>...</title>. Instead we have <TITLE>...</TITLE>. So what happened to the 2 find calls? Both will be -1!

begin = html.find('<title>')   # Result: -1
end   = html.find('</title>')  # Result: -1

Then begin+len('<title>') will be -1 + 7 = 6. So your final line would be extracting html[6:-1]. It turns out that negative indices actually mean something legitimate in Python (for good reasons). It means to count from the back. Hence -1 here refers to the last character in html. So what you are getting is a substring from the 6th character (inclusive) to the last character (exclusive).

What can we do then? Well, for one, you can use regular expression matcher that ignore case or use a proper HTML parser. If this is a one-off thing and space/performance isn't much of a concern, the quickest approach might be to create a copy of html and lower-cased the entire string:

def get_title(html):
    html_lowered = html.lower();
    begin = html_lowered.find('<title>')
    end = html_lowered.find('</title>')
    if begin == -1 or end == -1:
        return None
    else:
        # Find in the original html
        return html[begin+len('<title>'):end].strip()
Chris Henry