views:

121

answers:

5

Say I have the following set of urls in a db

url                     data
^(.*)google.com/search   foobar
^(.*)google.com/alerts   barfoo
^(.*)blah.com/foo/(.*)   foofoo
... 100's more

Given any url in the wild, I would like to check to see if that url belongs to an existing set of urls and get the corresponding data field.

My questions are:

  1. How would I design the db to do it
  2. django does urlresolution by looping through each regex and checking for a match given that there maybe 1000's of urls is this the best way to approach this?
  3. Are there any existing implementations I can look at?
A: 

Django has the advantage that its URLs are generally hierarchical. While the entire Django project may well have 100s or more URLs it's probably dealing only with a dozen or less patterns at a time. Do you have any structure in your URLs that you could exploit this way?

Other than that, you could try creating some kind of heuristics. E.g. finding the "fixed" parts of your patterns and then eliminating some of them and then (by a simple substring search) and only then switch to regex matching.

At the extreme end of the spectrum, you could create a product automaton. That would be super fast but the memory requirements would probably be impractical (and likely to remain so for the next few centuries).

oggy
A: 

Before determining that the django approach could not possibly work, try implementing it, and applying a typical workload. For a really thourough approach, you could actually time the cost of each regex and that can guide you in improving the most costly and most frequently used regexes. In particular, you could arrange for the most frequently used, inexpensive regexes to the front of the list. This is probably a better choice than inventing a new technology to fix a problem you don't even know you have yet.

TokenMacGuy
+1 I wish I could upvote it more than once!
wuub
-1 the above is a non-answer that sets up strawmen to debunk. Neither have I determined that the django approach does not work nor am I inventing something new for a problem that I apparently don't know exists.Given a large number of urls in a db how would one design it such that one can identify a regex match given an arbitrary url. I am exploring different possible ways before settling on any one.
molicule
@molicule: well, that right there was how I would implement it. In particular, I would always start with the simplest thing that could possibly work, which in this case, since I've already decided that regexes do some of what I want, then I would just put them all in a list of some sort and check them until one matches. The only reason I would START with other considerations is because I think that some other solution might be even simpler than a list of regexes.
TokenMacGuy
well are you suggesting that i read the db and load "all regexes" in a list and than scan for a match. Would this not be quite cumbersome and given that appengine has a 1000 record fetch limit would this not seem a bit much to try and hunt down one match.Are there any implementations where one attempts to disassemble a url into a regex construct that one can than attempt to find a match for. This way one could take a million urls disassemble them into regex patterns and save them in a db and than given an incoming url disassemble that and than try to find a matching db record. Or some such.
molicule
So, the thing about regular expressions, is that you can't do much with them except match a string against them. You can't use them to define ranges. There is no way to simplify them algorithmically. You can't sort them by how specific they are. You can't deduce what sort of strings would match. In short, regular expressions are HORRIBLE database keys. pick another key, or another storage model, or live with the consequences
TokenMacGuy
A: 

You'll certainly need more care in your design of regular expressions. For example, the prefix ^(.*) will match any input - and while you may need the prefix to capture a group for various reasons, having it there will mean that you can't really eliminate any of the URLs in your database easily.

I sort of agree with TokenMacGuy's comment about the intractability of regexes, but the situation may not be completely hopeless depending on the true scale of your problem. For example, for an URL to match, then its first character should match; so for example you could pre-filter your URLs by saying which first character in the input will match that URL. So, you have a secondary table MatchingFirstCharacters which is a lookup between initial characters and URLs which match up to that initial character. (This will only work if you don't have lots of ambiguous prefixes, as I mentioned in the first paragraph of my answer.) Using this approach will mean you don't necessarily have to load all the regexes for full matching - just the ones where at least the first character matches. I suppose the idea could be generalised further, but that's an exercise for the reader ;-)

Vinay Sajip
I expect to have named groups to pick of things like subdomains, modules, dates, slugs etc. The regex as shown was representational.TokenMacGuy's bloviations about regex's are not solutions. Regex's are what they are. Your MatchingFirstCharacters approach is on the same track as a possible solution that I'm leaning towards where I use the domain+tld as a unique key, see answer below.
molicule
A: 

The plan I'm leaning towards is one which picks of the domain name + tld from a url, uses that as a key to find out all the regexes and than loops through each of this regex subset to find a match.

I use two tables for this

class Urlregex(db.Model):
    """
    the data field is structured as a newline separated record list
    and each record is a space separated list of regex's and 
    dispatch key. Example of one such record

    domain_tld: google.com
    data:
        ^(.*)google.com/search(.*) google-search

    """
    domain_tld = db.StringProperty()
    data = db.TextProperty()

class Urldispatch(db.Model):
    urlkey = db.StringProperty()
    data = db.TextProperty()

So, for the cost of 2 db reads and looping through a domain specific url subset any incoming url should be able to be matched against a large db of urls.

molicule
+1  A: 

"2. django does urlresolution by looping through each regex and checking for a match given that there maybe 1000's of urls is this the best way to approach this?"

"3. Are there any existing implementations I can look at?"

If running a large number of regular expressions does turn out to be a problem, you should check out esmre, which is a Python extension module for speeding up large collections of regular expressions. It works by extracting the fixed strings of each regular expression and putting them in an Aho-Corasick-inspired pattern matcher to quickly eliminate almost all of the work.

Will Harris