ansaurus

Question

Cant determine whats causing an regex error, and would like some input on the efficiency of my program

Answer 1

+3 A:

It seems like your problem is that some of the URLs contain special regex characters, such as ? and +; for instance, the string ++ is really quite likely. The other problem is that you keep overwriting the regex you're using to test. If you just need to check if one string is contained in another, there's no need for a regex; just use

for tgt in self.links:
    for link in (self.mal_list + self.phish_list):
        if link in tgt: print link

And if you're just comparing for equality, you can use == instead of in.

Antal S-Z 2010-10-28 06:01:20

that actually makes more sense. If however, I wanted to look for portions of the link that compare to the malware list, would something like urlparse work? I'm just trying to picture links which are similar and lead to to known malware domains, but aren't verbatim from the list and how I would deal with those.

Stev0 2010-10-28 14:41:44

I'm not sure—I don't use Python much—but the broad answer is "it depends what you mean by similar" :-) A regex could be a sensible tool, though generalizing from a single URL is not obviously straightforward. If you want to say "everything from this domain is bad", then it looks like urlparse would be a sensible approach.

Antal S-Z 2010-10-28 15:53:22

ansaurus

tags:

views:

answers:

Cant determine whats causing an regex error, and would like some input on the efficiency of my program

related questions