Hey guys, I'm not trying to do anything malicious here, I just need to do some homework. I'm a fairly new programmer, I'm using python 3.0, and I having difficulty using recursion for problem-solving. I've been stuck on this question for quite a while. Here's the
Assignment:
Write a recursive method spam(url, n) that takes a url of a web page as input and a non-negative integer n, collects all the email address contained in the web page and adds them to a global dictionary variable spam_dict, and then recursively calls itself on every http link contained in the web page.
You will use a dictionary so only one copy of every email address is saved; your dictionary will store (key,value) pairs (email, email). The recursive call should use the parameter n-1 instead of n. If n = 0, you should collect the email addresses but no recursive calls should be made. The parameter n is used to limit the recursion to at most depth n.
You will need to use the solutions of the two above problems; your method spam() will call the methods links2() and emails() and possibly other functions as well.
Notes:
- running spam() directly will produce no output on the screen; to find your spam_dict, you will need to read the value of spam_dict, and you will also need to reset it to the empty dictionary before every run of spam.
- Recall how global variables are used.
Usage:
>>> spam_dict = {}
>>> spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',0)
>>> spam_dict.keys()
dict_keys([])
>>> spam_dict = {}
>>> spam('http://reed.cs.depaul.edu/lperkovic/csc242/test1.html',1)
>>> spam_dict.keys()
dict_keys(['[email protected]', '[email protected]'])
So far, I've written a function that traverses web pages and puts all the links in a nice little list, and what I wanted to do was call that functions. And why would I use recursion on a dictionary? And how? I don't understand how n ties into all of this.
def links2(url):
content = str(urlopen(url).read())
myparser = MyHTMLParser()
myparser.feed(content)
lst = myparser.get()
mergelst = []
for link in lst:
mergelst.append(urljoin(lst[0],link))
print(mergelst)
Any input (except why spam is bad) would be greatly appreciated. Also, I realize that the above function could probably look better, if you have a way to do it, I'm all ears. However, all I need is the point is for the program to produce the proper output.
Added:
I wrote a function that collects emails from a page, but I'm not sure how to lump .com and .edu and .org all together.
from re import findall
def emails(url):
links = str(links3(url))
# how do I construct pattern?
pattern='[A-Za-z0-9_.]+\@[A-Za-z0-9_.]+.com\.edu\.org
lst = findall(pattern,links)
print(lst)
How do I tell python that? I can't find it in the documentation.