views:

64

answers:

3

Working with a couple of lists, iterating over each. Here's a code segment:

self.links = []
self.iter=iter(self.links)
for tgt in self.links:
    for link in self.mal_list:
        print(link)
        if tgt == link:
           print("Found Suspicious Link: {0}".format(tgt))
           self.count += 1

        else:
           self.count += 1
           self.crawl(self.iter.next())

Its advancing to the next item in the link list, just fine. For the malware signature list I tried using a similar iter item, but I'm not entirely sure if thats even the best way, and if so were to place it in my code so that each link that is urlopened from the list is compared to every item in the malware list BEFORE the loop opens up the next item in the link list. Any suggestions?

+1  A: 

The essential way that you are doing it is fine, but it will be slow.

Try this instead:

 for tgt in links:
      if tgt in mal_links:
          # you know that it's a bad link
      else:
          crawl(tgt)

I don't see why you are keeping two iterators going over the list. This will introduce a bug because you don't call next on self.iter in the case that you detect a malware link. The next time tgt isn't a bad link, when you call next, it will advance to the previously detected bad link and you will crawl that. Is there some reason that you feel the need to step over two copies of the iterator instead of just one?

Also, your initial code will crawl page once for every time it is not determined to be equal to a given malware link. This might lead to some angry web masters depending on how big your list is.

aaronasterling
@aaronasterling: +1 Good to catch that bug !!
pyfunc
Well I made a self.iter because when call crawl(tgt) it just kept crawling the first item in the links list over and over. I understand what you have here about removing the second for loop, after thinking about what it was doing, this make way more sense, however I'm now left with the original problem of it not advancing items. is an iter the best way to go about this, and ensuring I call .next() in both blocks of the if-else statements?
Stev0
@Stev0: That would be a bug in your crawl(); you'd probably be better off to fix it there instead.
Lie Ryan
@Lie Ryan: How so? tgt is the first item in the list, by calling crawl(tgt), I just keep passing it the first item it iterated over. Should I move .next() up the beginning of the crawl function?
Stev0
@Stev0: self.crawl(link) should just collect the links contained in a single page, and put it somewhere for future checking; self.crawl() shouldn't call your malware checking function.
Lie Ryan
@Lie Ryan: I'm getting confused now. Why does that matter? And why would that be preventing the function from continuing on to the next link in the list? Essentially I just want this thing to be a web spider that checks each link against a list of known bad links before spidering to the next link. Am I making sense? Seemed like a good idea at the time :)
Stev0
@Stev0: sorry if that confuses you, basically, what I'm suggesting is to rethink through the design of the application and rethink through what each functions does. The reason you gave about why you need `self.iter` at all seems to indicate that your design is rather fishy.
Lie Ryan
+2  A: 

Not sure, what you are trying to ask but you could simplify your code. Though this is not necessary.

self.links = []
self.non_malware_link = [link for link in self.links if link not in self.mal_list]
results = map(self.crawl, self.non_malware_link)

On some issues with your code:

  1. self.count is exactly the same as len(self.links)

Apart from meaning of self.count, every thing else looks like it does what it needs to do.

pyfunc
@pyfunc: I'm assuming that he will be using self.count to inform user of the current progress of the scanning?
Lie Ryan
@Lie Ryan: Thanks, Thats nice!
pyfunc
@Stev0: @Lie Ryan: Who ever voted for me. My answer remains here for reference but I find Lie Ryan's answer as elegant as using sets removes unnecessary duplicate comparison. I am up voting his answer. Please have a look. The suggestion by aaronasterling makes a lot of sense too.
pyfunc
+1  A: 

Searching an item inside a list is slow, if this is what you're trying to do, then use a dict or a set instead of list for the self.mal_list:

mal_list = set(self.mal_list)
for tgt in self.links:
    if tgt in mal_list: 
        print("Found Suspicious Link: {0}".format(tgt))
        self.count += 1
    else:
        self.count += 1
        self.crawl(self.iter.next())

or, if you can have self.links as set as well:

mal_list = set(self.mal_list)
links = set(self.links)
detected = links.intersection(mal_list)
for malware in detected:
    print("Found Suspicious Link: {0}".format(tgt))
    self.count += 1
Lie Ryan
Good thought on using sets but, in stepping over two versions of `links`, you've carried over a bug from the OP's original code.
aaronasterling
Thanks for the info on the sets, anthing I can do to speed it up is great.
Stev0