Nearing what I would like to think is completion on a tool I've been working on. What I've got going on is some code that does essentially this:
open several files and urls which consist of known malware/phishing related websites/domains and create a list for each, Parse the html of a url passed when the method is called, pulling out all the a href links and placing them in a separate list,
for every link that was placed in the new list, create a regex for every item thats in the malware and phishing lists, and then compare against to determine if any of the links parsed from the URL passed when the method was called are malicious.
The problem I've ran into is in iterating over the items of all 3 lists, obviously I'm doing it wrong since its throwing this error at me:
File "./test.py", line 95, in <module>
main()
File "./test.py", line 92, in main
crawler.crawl(url)
File "./test.py", line 41, in crawl
self.reg1 = re.compile(link1)
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat
The following is the segment of code I'm having problems with, with the malware related list create omitted as that part is working fine for me:
def crawl(self, url):
try:
doc = parse("http://" + url).getroot()
doc.make_links_absolute("http://" + url, resolve_base_href=True)
for tag in doc.xpath("//a[@href]"):
old = tag.get('href')
fixed = urllib.unquote(old)
self.links.append(fixed)
except urllib.error.URLERROR as err:
print(err)
for tgt in self.links:
for link in self.mal_list:
self.reg = re.compile(link)
for link1 in self.phish_list:
self.reg1 = re.compile(link1)
found = self.reg.search(tgt)
if found:
print(found.group())
else:
print("No matches found...")
Can anyone spot what I've done wrong with the for loops and list iteration that would be causing that regex error? How might I fix it? And probably most importantly is the way I'm going about doing this 'pythonic' or even efficient? Considering what I'm trying to do here, is there a better way of doing it?