views:

62

answers:

3

I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&amp;number_0=1112223333&amp;response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)

My program currently searches the file and finds the line containing the div tag, but now I need a way to store the next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC

+2  A: 

You should be using a HTML parser such as BeautifulSoup or lxml instead.

Ignacio Vazquez-Abrams
Could you explain to me how to do this with either of those?
ErikT
`soup.find('div', {'class': 'carrier_result'}).text`
Ignacio Vazquez-Abrams
Thank you for the example
ErikT
+2  A: 

What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.

Sample code:

import urllib2, BeautifulSoup

opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')

response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&amp;number_0=1112223333&amp;response=1').read()

bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()
MikeyB
Could you please explain how to do this with beautifulsoup? I looked at their website and was confused.
ErikT
Be wary of 'working around' a website's controls - this may draw their ire.
MikeyB
Thank you, and thanks for the advice too. I'll keep that in mind, though I'll probably have to stick with it this way as I have yet to find another way to find a carrier given the cellphone number.
ErikT
+2  A: 

to get the next line, you can use

htmlsource = open('carrier.html', 'r')
for line in htmlsource:
    if '<div class="carrier_result">' in line:
         nextline = htmlsource.next()
         print nextline

A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg

data=open("carrier.html").read().split("</div>")
for item in data:
    if '<div class="carrier_result">' in item:
       print item.split('<div class="carrier_result">')[-1].strip()

by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.

ghostdog74
Thank you. Your answer was the only one that did not use beautifulsoup, but since so many other answers did I might try it both ways. I tried using urllib, but it did not work. This is because the website only allows views from certain browsers (and this is the reason why I had to call wget with a certain browser agent). If there is a way to use urllib and faking a browser agent, please tell me as I would much rather not have to call wget.
ErikT
Heh, I noticed that too and am about to post a workaround... be careful as this may piss them off.
MikeyB
if you look at the urllib2 documentation http://docs.python.org/library/urllib2.html, way below the page, there are examples of adding http headers to your requests. Not sure if it will work for you, but you can give it a try. As for using BeautifulSoup and stuff, i believe ideally you should use it, but i also believe that if the problem you are trying to solve is simple enough, there is no need to use them. Just using Python in builts will do
ghostdog74
Your answer was good, but MikeyB's is more efficient and makes good use of BeautifulSoup.
ErikT