views:

195

answers:

2

I have a model like this:

class CampaignPermittedURL(models.Model):
    hostname = models.CharField(max_length=255)
    path = models.CharField(max_length=255,blank=True)

On a frequent basis, I will be handed a URL, which I can urlsplit into a hostname and a path. What I'd like is for the end user to be able to enter a hostname (yahoo.com) and possibly a path (weddings).

I'd like to find when a URL does not 'match' that hostname/path combination like so:

  • success: www.yahoo.com/weddings/newyork
  • success: yahoo.com/weddings
  • failure: cnn.com
  • failure: cnn.com/weddings

I think the best way to do this is:

url = urlsplit("http://www.yahoo.com/weddings/newyork")
### split hostname on . and path on /
matches = CampaignPermittedURL.objects.filter(hostname__regex=r'(com|yahoo.com|www.yahoo.com)'), \
    path__regex=r'(weddings|weddings/newyork)')

Does anybody have better ideas? I am using PostgreSQL and would otherwise want to try Django Full Text Search but I'm not sure if that's worth it or if it really fits my needs any better than this. Are there other methods that are equally fast?

Keep in mind that my method has the URL passed to it and that the CampaignPermittedURL object may have many hundred records. I am looking for extensible/maintainable solutions foremost, but it does also need to be efficient since this will be scaled to several hundred calls a second.

I'm also fine with using another back-end (Sphinx?) but I am most concerned about staying with standard Django to the highest degree possible.

+1  A: 

Regex: ^(http\:\/\/)?(www\.)?yahoo\.com(\/.+)?$

http://www.yahoo.com/weddings/newyork pass
www.yahoo.com/weddings/foo            pass
www.yahoo.com/weddings                pass
www.yahoo.com                         pass

yahoo.com/weddings/foo                pass
yahoo.com/weddings                    pass
yahoo.com                             pass

cnn.com/weddings/foo                  fail
cnn.com/weddings                      fail
cnn.com                               fail
macek
That basically simplifies my answer a bit to this - best answer so far though:(hostname__regex=r'(www\.)?(yahoo\.)?(com)?$'), \ path__regex=r'^(weddings\/)?(newyork)?')
Adam Nelson
A: 

I ended up constructing a 'verbose' regex and using the ORM as specified in the question. This should be quite fast while not departing from Django:

        # >>> url.hostname.split(".")
    # ["bakery", "yahoo", "com"]
    host_list = url.hostname.split(".")

    # Build regex like r"^$|^[.]?com$|^[.]?yahoo\.com$|^[.]?baking[.]yahoo[.]com$"
    # Remember that
    # >>> r'\'
    # '\\'
    host_list.reverse()

    # append_str2 might not be necessary
    append_str = r""
    append_str2 = r""
    host_regex = r"^$"
    for host in host_list:
        append_str = r"[.]" + host + append_str
        append_str2 = append_str[3:]
        host_regex = host_regex + r"|^[.]?" + append_str2 + r"$"
    # If nothing is in the filter at all, bypass the filter.
    if CampaignRequiredURL.objects.filter():
        if not CampaignRequiredURL.objects.filter(hostname__iregex=host_regex):
            #Do something based on a hit.
Adam Nelson