views:

43

answers:

1

Nearing what I would like to think is completion on a tool I've been working on. What I've got going on is some code that does essentially this:

open several files and urls which consist of known malware/phishing related websites/domains and create a list for each, Parse the html of a url passed when the method is called, pulling out all the a href links and placing them in a separate list,

for every link that was placed in the new list, create a regex for every item thats in the malware and phishing lists, and then compare against to determine if any of the links parsed from the URL passed when the method was called are malicious.

The problem I've ran into is in iterating over the items of all 3 lists, obviously I'm doing it wrong since its throwing this error at me:

File "./test.py", line 95, in <module>
main()
File "./test.py", line 92, in main
crawler.crawl(url)
File "./test.py", line 41, in crawl
self.reg1 = re.compile(link1)
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat

The following is the segment of code I'm having problems with, with the malware related list create omitted as that part is working fine for me:

def crawl(self, url):
        try:
            doc = parse("http://" + url).getroot()
            doc.make_links_absolute("http://" + url, resolve_base_href=True)
            for tag in doc.xpath("//a[@href]"):
                old = tag.get('href')
                fixed = urllib.unquote(old)
                self.links.append(fixed)

        except urllib.error.URLERROR as err:
            print(err)

        for tgt in self.links:
            for link in self.mal_list:
                self.reg = re.compile(link)
            for link1 in self.phish_list:
                self.reg1 = re.compile(link1)

            found = self.reg.search(tgt)
            if found:
                print(found.group())
            else:
                print("No matches found...")

Can anyone spot what I've done wrong with the for loops and list iteration that would be causing that regex error? How might I fix it? And probably most importantly is the way I'm going about doing this 'pythonic' or even efficient? Considering what I'm trying to do here, is there a better way of doing it?

+3  A: 

It seems like your problem is that some of the URLs contain special regex characters, such as ? and +; for instance, the string ++ is really quite likely. The other problem is that you keep overwriting the regex you're using to test. If you just need to check if one string is contained in another, there's no need for a regex; just use

for tgt in self.links:
    for link in (self.mal_list + self.phish_list):
        if link in tgt: print link

And if you're just comparing for equality, you can use == instead of in.

Antal S-Z
that actually makes more sense. If however, I wanted to look for portions of the link that compare to the malware list, would something like urlparse work? I'm just trying to picture links which are similar and lead to to known malware domains, but aren't verbatim from the list and how I would deal with those.
Stev0
I'm not sure—I don't use Python much—but the broad answer is "it depends what you mean by similar" :-) A regex could be a sensible tool, though generalizing from a single URL is not obviously straightforward. If you want to say "everything from this domain is bad", then it looks like urlparse would be a sensible approach.
Antal S-Z