ansaurus

Question

Creating a spam list with a web crawler in python

Answer 1

+2 A:

Think about how recursion works. What you want is for your function to be able to call itself in some cases. In this case, you need to add a parameter for the recursion level to your function, and then you need to figure out what it should do in the various cases?

At the most basic level, what should it do with n=0? (hint: you've about got it already)

What should it do if n=1? You probably want to call your function again on each element of your existing list with n=0.

What about if n is greater than 1? You want to call your function again with n = n-1 on each element you've got so far.

Paul McMillan 2010-05-01 01:02:18

quick question... I wrote a function that collects emails from a page, but I'm not sure how to lump .com and .edu and .org all together. from re import findall def emails(url): links = str(links3(url)) pattern='[A-Za-z0-9_.]+\@[A-Za-z0-9_.]+.com\.edu\.org #how do I do the above properly? lst = findall(pattern,links) print(lst)How do I tell python that? I can't find it in the documentation.

ptabatt 2010-05-01 01:26:26

Simple regexp alternation is as easy as `...@[\w\.]+\.(com|edu|org)`

msw 2010-05-01 08:48:07

Also, it is better to add to your original question than try and cram it in this little box. I did it for you (see "Added:" above)

msw 2010-05-01 08:57:32

Answer 2

+1 A:

n would play into it, as the problem states, by limiting the recursion to a maximum "call depth".

The idea is that since you're recursively invoking the scanning for emails from an already-running scan, you build up a call stack of what called what that gets deeper and deeper as you continue to recursively call the scanner.

You don't want it to go on forever, so as one of the arguments you pass an integer that you decrement each time you make a call. When it reaches 0, you stop doing recursive calls and let the sequence of recursions unwind itself.

call 1 (args...., n=3)
   call 2a (args...., n=2)
       call 3 (args...., n=1)
            call 4a (args..., n=0) <-- these calls won't call more scans
            call 4b (args..., n=0) <-- because n=0, so this is max depth
   call 2b (args...., n=2)

Amber 2010-05-01 01:03:03

Another question... I wrote a function that collects emails from a page, but I'm not sure how to lump .com and .edu and .org all together. from re import findall def emails(url): links = str(links3(url)) pattern='[A-Za-z0-9_.]+\@[A-Za-z0-9_.]+.com\.edu\.org #how do I do the above properly? lst = findall(pattern,links) print(lst)How do I tell python that? I can't find it in the documentation.

ptabatt 2010-05-01 01:28:55

ansaurus

tags:

views:

answers:

Creating a spam list with a web crawler in python

Assignment:

Notes:

Usage:

Added:

related questions