views:

169

answers:

5

I'm obviously missing something here. Same project I've been working on for a number of days. Stepping through it bit by bit, seemed to be working fine. I added in a portion of the main() function to actually create the comparison lists, and suddenly starts throwing out cannot pop from empty list error at me, even through a print function I've placed ahead of the pop() call clearly shows that the list is not empty? Any ideas what I'm doing wrong? and is this monstrosity gonna actually work the way I intend? First time working with threads and all. Here is the code in its entirety:

import urllib
import urllib2
import sys
from lxml.html import parse, tostring, fromstring
from urlparse import urlparse
import threading



class Crawler(threading.Thread):

 def __init__(self):
    self.links = []
    self.queue = []
    self.mal_list = []
    self.count = 0
    self.mal_set = set(self.mal_list)
    self.crawled = []
    self.crawled_set = set(self.crawled)
    self.links_set = set(self.links)
    self.queue.append(sys.argv[1])
    self.queue_set = set(self.queue)



def run(self, max_depth):
    print(self.queue)
    while self.count < max_depth:
        tgt = self.queue.pop(0)
        if tgt not in self.mal_set:
            self.crawl(tgt)
        else:
            print("Malicious Link Found: {0}".format(tgt)
            continue
    sys.exit("Finished!")


def crawl(self, tgt):
    url = urlparse(tgt)
    self.crawled.append(tgt)
    try:
        print("Crawling {0}".format(tgt))
        request = urllib2.Request(tgt)
        request.add_header("User-Agent", "Mozilla/5,0")
        opener = urllib2.build_opener()
        data = opener.open(request)
        self.count += 1

    except:
        return


    doc = parse(data).getroot()
    for tag in doc.xpath("//a[@href]"):
            old = tag.get('href')
            fixed = urllib.unquote(old)
            self.links.append(fixed)
            self.queue_links(self.links_set, url)


def queue_links(self, links, url):
        for link in links:
            if link.startswith('/'):
                link = "http://" + url.netloc + "/" + link

            elif link.startswith('#'):
                continue

            elif link.startswith('http'):

                link = 'http://' + url.netloc + '/' + link

            if link.decode('utf-8') not in self.crawled_set:
                self.queue.append(link)




def make_mal_list(self):
    """
    Open various malware and phishing related blacklists and create a list 
    of URLS from which to compare to the crawled links
    """
    hosts1 = "hosts.txt"
    hosts2 = "MH-sitelist.txt"
    hosts3 = "urls.txt"

    with open(hosts1) as first:
        for line1 in first.readlines():
            link = "http://" + line1.strip()
            self.mal_list.append(link)

    with open(hosts2) as second:
        for line2 in second.readlines():
            link = "http://" + line2.strip()
            self.mal_list.append(link)

    with open(hosts3) as third:
        for line3 in third.readlines():
            link = "http://" + line3.strip()
            self.mal_list.append(link)
def main():
    crawler = Crawler()
    crawler.make_mal_list()
    crawler.run(25)
if __name__ == "__main__":
  main()
+1  A: 

My primary language is C#, but issue you are experiencing is because of threading. In thread #1 you check that list is not empty, while thread #2 clears that list and thus you receive exception.

Tomas Voracek
+1  A: 

list is not thread-safe. If you need a thread-safe data structure, use Queue.Queue (Python 2.x) or queue.Queue (Python 3.x).

Lie Ryan
That's all true, but the code he posted creates only one single thread ...
THC4k
+4  A: 

First of all , i did get lost while reading your code so maybe i can give you some remark if i may before:

  • to many instance variable you don't have to create a new instance var just to put on it a set() of another vars like this code : self.mal_set = set(self.mal_list)and you are repeating the same thing many times

  • if you want to use threading so use it, because in your code you are just creating one thread, for that you should create like (10) thread or so each thread will deal with a bunch of URL that he should fetch, and don't forget to put the threads in a Queue.Queue to synchronize between them.

  • EDIT : Ahh i forgot : indent your code :)

now about your problem :

where do you assign self.queue because i don't see it ? you are just calling the make_mal_list() method that will initialize only self.mal_listand after when you run you own thread i think it's obvious that self.queue is empty so you can't pop() right ?

EDIT 2:

i think your example is more complicate (using black list and all this stuff, ...) but you can start with something like this:

import threading
import Queue
import sys
import urllib2
import url
from urlparse import urlparse

THREAD_NUMBER = 10


class Crawler(threading.Thread):

    def __init__(self, queue, mal_urls):
        self.queue = queue
        self.mal_list = mal_urls
        threading.Thread.__init__(self) # i forgot , thanks seriyPS :)

    def run(self):

        while True:
             # Grabs url to fetch from queue.
             url = self.queue.get()
             if url not in self.mal_list:
                 self.crawl(url)
             else:
                 print("Malicious Link Found: {0}".format(url)
                 continue # you can remove this too i think
             # Signals to queue job is done
             self.queue.task_done()

     def crawl(self, tgt):
         try:
             url = urlparse(tgt)
             print("Crawling {0}".format(tgt))
             request = urllib2.Request(tgt)
             request.add_header("User-Agent", "Mozilla/5,0")
             opener = urllib2.build_opener()
             data = opener.open(request)
         except: # TODO: write explicit exceptions the URLError, ValueERROR ...
             return

         doc = parse(data).getroot()
         for tag in doc.xpath("//a[@href]"):
             old = tag.get('href')
             fixed = urllib.unquote(old)

             # I don't think you need this, but maybe i'm mistaken.
             # self.links.append(fixed) 

             # Add more URL to the queue.
             self.queue_links(fixed, url)


    def queue_links(self, link, url):
        """I guess this method allow recursive download of urls that will
        be fetched from the web pages ????
        """

        #for link in links:  # i changed the argument so now links it just one url.
        if link.startswith('/'):
            link = "http://" + url.netloc + "/" + link

        elif link.startswith('#'):
            continue

        elif link.startswith('http'):
            link = 'http://' + url.netloc + '/' + link

        # Add urls extracted from the HTML text to the queue to fetche them
        if link.decode('utf-8') not in self.crawled_set:
            self.queue.put(link)


def get_make_mal_list():
    """Open various malware and phishing related blacklists and create a list 
    of URLS from which to compare to the crawled links
    """

    hosts1 = "hosts.txt"
    hosts2 = "MH-sitelist.txt"
    hosts3 = "urls.txt"

    mal_list = []

    with open(hosts1) as first:
        for line1 in first.readlines():
            link = "http://" + line1.strip()
            mal_list.append(link)

    with open(hosts2) as second:
        for line2 in second.readlines():
            link = "http://" + line2.strip()
            mal_list.append(link)

    with open(hosts3) as third:
        for line3 in third.readlines():
            link = "http://" + line3.strip()
            mal_list.append(link)

    return mal_list

def main():

    queue = Queue.Queue()

    # Get malicious URLs.
    mal_urls = set(get_make_mal_list())

    # Create a THREAD_NUMBER thread and start them.
    for i in xrange(THREAD_NUMBER):
        cr = Crawler(queue, mal_urls)
        cr.start()

    # Get all url that you want to fetch and put them in the queue.
    for url in sys.argv[1:]:
        queue.put(url)

    # Wait on the queue until everything has been processed.
    queue.join()


if __name__ == '__main__':
    main()
singularity
I'm calling self.queue.append(sys.argv[1]) in the __init__ method. Also, what would you suggest in terms of the instance variable issue? And I can have multiple threads all referencing the same list of thinks? or should I just get rid of the threading part? I was just thinking as this is supposed to be spidering, that having multiple threads would be feasible to speed it up.
Stev0
@Stev0: ahh ok i didn't see it there, i will suggest to not do it like this if you want to pass an argument for instance to be created make it visible like so Crawler(list_url), but what is in sys.argv[1] are you passing one url to you script or ????
singularity
@singularity: Thanks for cleaning the code up a bit, that looks much more feasible, and gives me a good understanding of the Queue class. Now the reason I was using the run(10) was arbitrary, I wanted to be able to specify the maximum number of links to follow from the command line, and now looking at this I'm not sure if determining the max number of threads is the same thing as maximum 'crawl depth'. What do you think?
Stev0
@Stev0: if i understood well your problem i think you can create a new queue for links to follow and max length to it like this __Queue.Queue(maxsize=40)__ , also you have a bunch of useful method of the queue that you can use check the doc: __http://docs.python.org/library/queue.html__ like check if queue is Empty Queue.empty() check if the queue is full Queue.full() ... is this working for you ???
singularity
@singularity: Yes this Queue class is incredibly helpful. I'm implementing the maxsize thing as maxsize=sys.argv[2], so I can determine max crawl depth at the command line. However, how can I go about telling it to stop looping once it has reached the maxsize? Also in the code you commented on not using the continue statement in the run function, what I'm trying to do there is make it skip that malicious link and move on to the next one in the queue, is that not the proper way to do that?
Stev0
@Stev0: it's just a little detail about continue statement i think that it's not necessary because after the loop you call __self.queue.task_done()__ and this will tell the Queue to hang the thread and pass to another thread but you can keep it if you want. about stop looping you can just do __while not queue.empty(): url = queue.get() ...__ .
singularity
Thanks for the .empty(), however I placed that in the run function, changing it from while True:, to while not self.queue.empty(): and the program just sits there and never begins crawling, and I must use ctrl+z to get out of it. Did I put it in the wrong place?
Stev0
Stev0: you don't have to change the __while True__ because the queue know when to stop, let me detail it more, so first of all we call __queue.get()__ it's like pop so the url is poped from the queue and in the end of each loop we call __queue.task_done()__ which signal to the queue that the thread is done and when the queue is empty it should stop, quote from doc : __If a join() is currently blocking, it will resume when all items have been processed (meaning that a task_done() call was received for every item that had been put() into the queue)__
singularity
So where should I put the queue.empty() then? And should it still be a while statement, or an if statement? as if if not self.queue.empty(): #do some stuff
Stev0
just tried it both as a while statement and an if...its still causing the program to hang and not actually begin crawling. EDIT: I just took out the threading portion and it caleld my exit loop function. So it is something to do with threading. Additionally without the threading.Thread in class, it never placed the sys.argv in the queue.
Stev0
EDIT2: the hanging has something to do with the queue.join(), soon as I commented it out, started crawling. But the original problem remains: It is not stopping at the specified number of crawls. Perhaps I should put the self.count stuff back in and just check the count number before each call to self.crawl?
Stev0
@Stev0: i think i misunderstood your problem when you told me how to stop when you reach max_size, for stopping you should stick with __while True__ like in my answer and it the queue that will stop from it own, hmmm , it's not crawling with __.join()__ ??? that weird because the .join() is there to hang the process and wait until all thread are done (when the queue is empty) can you print the queue each time, you crawl ?
singularity
Well it was crawling fine until I started trying to tell it when to stop using max_size. Basically what I want to do is call it from the command line like this: ./crawler.py http://www.test.com 50, and then it places www.test.com at the front of the queue, and away it goes. But for each iteration of the loop it checks to make sure it hasn't reached 50 yet, and once it has, calls sys.exit(). That way it doesnt just crawl the entire internet forever.
Stev0
In light of that, is threading still a good way to go about this? My original intent to include threading was to make the process faster, as I would like to eventually tell it to crawl like 5000 links or something, and I figured just doing it in 1 thread would be slow. Regardless I am learning a butt load about threading from this dialog of ours, and I truly appreciate it. :) Any insight on whats the best way to incorporate the max crawl depth?
Stev0
Stev0 __But for each iteration of the loop it checks to make sure it hasn't reached 50 yet, and once it has, calls sys.exit()__ no need to check like i told it should stop from it's own after the queue is empty; about using the threads for crawling the thread make thing faster, and using the Queue with the thread is called thread pool and it's a design pattern very useful when work with threads here some links that may be helpful for you: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ ,http://julien.herbin.ecranbleu.org/threadpool/ if you need more help let me now :)
singularity
Thanks for continued help. What you've just said makes sense. But now its still hanging. It gets to the "Crawling test.com..." portion and then hangs. I placed a print(self.queue) in the method before that part and it gives me a Queue instance at <some random memory address>. Also I understand that as I specify the maxsize of the queue it will stop adding to the threadpool once it has reached that maxsized? And if thats the case, should I put the sys.exit("Finished Crawling") call outside the while True: loop?
Stev0
__should I put the sys.exit("Finished Crawling") call outside the while True: loop?__ no you don't have too , __it will stop adding to the threadpool once it has reached that maxsized?__ yes by it self , you should try to debug maybe something is wrong i just now try __queue.get()__ in my console while the queue was empty and it hung out , despite the fact that queue are very powerful tool they are very hard to debug try playing with the queue method to debug or use a real debugger like pdb or pydev debugger ...
singularity
Found the problem. I had a residual self.count += 1 in there. Removed it and now its working. One last problem and its working perfectly :) The portion of the code that I placed in to check if it has crawled the link before isn't working. If it encounters a link that it has already crawled, it crawls it again anyway even though I've told the code to check it against the set of already crawled links. Any insight?
Stev0
@Stev0: i think the easiest way to do it is by creating an instance variable where you can put all the urls that have already been fetched and when you have a new url you can check if it exist or not in this list; hope this will help :)
singularity
+1  A: 

Small offtopic:

class Crawler(threading.Thread):
    def __init__(self):
        #you code
        threading.Thread.__init__(self)#!!!

don't forget run Thread.__init__(self) directly if you override __init__ function And, ofcourse, you must use http://docs.python.org/library/queue.html class for implement you job's queue in thread-safe mode

seriyPS
A: 

Also, look on this fragment:

print(self.queue)
while self.count < max_depth:
    tgt = self.queue.pop(0)

you do print(self.queue) only before in first while iteration, so, self.queue.pop() can make many iterations (and fetch many links) and raise "cannot pop from empty list" only when queue is really empty!

try this:

while self.count < max_depth:
    print(self.queue)
    tgt = self.queue.pop(0)

for detect moment when you take exception.

seriyPS