ansaurus

Question

Hashtable/dictionary/map lookup with regular expressions

Answer 1

A:

The fundamental assumption is flawed, I think. you can't map hashes to regular expressions.

Jimmy 2008-11-03 21:44:21

You can, at least in Python. It's not very useful, though (for me at least), as they'll only match the same regex object.

Jeff 2008-11-03 22:11:10

Answer 2

+4 A:

This is not possible to do with a regular hash table in any language. You'll either have to iterate through the entire keyset, attempting to match the key to your regex, or use a different data structure.

You should choose a data structure that is appropriate to the problem you're trying to solve. If you have to match against any arbitrary regular expression, I don't know of a good solution. If the class of regular expressions you'll be using is more restrictive, you might be able to use a data structure such as a trie or suffix tree.

Adam Rosenfield 2008-11-03 21:44:41

Answer 3

A:

I don't think it's even theoretically possible. What happens if someone passes in a string that matches more than 1 regular expression.

For example, what would happen if someone did:

>>> regex_dict['FileNfoo']

How can something like that possibly be O(1)?

Moe 2008-11-03 21:44:49

Answer 4

+1 A:

What happens if you have a dictionary such as

regex_dict = { re.compile("foo.*"): 5, re.compile("f.*"): 6 }

In this case regex_dict["food"] could legitimately return either 5 or 6.

Even ignoring that problem, there's probably no way to do this efficiently with the regex module. Instead, what you'd need is an internal directed graph or tree structure.

Eli Courtwright 2008-11-03 21:46:56

Answer 5

+3 A:

In the general case, what you need is a lexer generator. It takes a bunch of regular expressions and compiles them into a recognizer. "lex" will work if you are using C. I have never used a lexer generator in Python, but there seem to be a few to choose from. Google shows PLY, PyGgy and PyLexer.

If the regular expressions all resemble each other in some way, then you may be able to take some shortcuts. We would need to know more about the ultimate problem that you are trying to solve in order to come up with any suggestions. Can you share some sample regular expressions and some sample data?

Also, how many regular expressions are you dealing with here? Are you sure that the naive approach won't work? As Rob Pike once said, "Fancy algorithms are slow when n is small, and n is usually small." Unless you have thousands of regular expressions, and thousands of things to match against them, and this is an interactive application where a user is waiting for you, you may be best off just doing it the easy way and looping through the regular expressions.

Glomek 2008-11-03 21:53:03

We anticipate having thousands of regexes soon. In all cases we have to match those regexes repeatedly, typically thousands of times per user operation. It may be okay to go with the naive solution and then rewrite the algorithm when performance degrades, as this does not need to run interactively.

Jeff 2008-11-03 22:08:11

Answer 6

+1 A:

As other respondents have pointed out, it's not possible to do this with a hash table in constant time.

One approximation that might help is to use a technique called "n-grams". Create an inverted index from n-character chunks of a word to the entire word. When given a pattern, split it into n-character chunks, and use the index to compute a scored list of matching words.

Even if you can't accept an approximation, in most cases this would still provide an accurate filtering mechanism so that you don't have to apply the regex to every key.

erickson 2008-11-03 21:54:01

Answer 7

A:

It may be possible to get the regex compiler to do most of the work for you by concatenating the search expressions into one big regexp, separated by "|". A clever regex compiler might search for commonalities in the alternatives in such a case, and devise a more efficient search strategy than simply checking each one in turn. But I have no idea whether there are compilers which will do that.

fivebells 2008-11-04 00:13:40

Answer 8

A:

It really depends on what these regexes look like. If you don't have a lot regexes that will match almost anything like '.*' or '\d+', and instead you have regexes that contains mostly words and phrases or any fixed patterns longer than 4 characters (e.g.'a*b*c' in ^\d+a\*b\*c:\s+\w+) , as in your examples. You can do this common trick that scales well to millions of regexes:

Build a inverted index for the regexes (rabin-karp-hash('fixed pattern') -> list of regexes containing 'fixed pattern'). Then at matching time, using Rabin-Karp hashing to compute sliding hashes and look up the inverted index, advancing one character at a time. You now have O(1) look-up for inverted-index non-matches and a reasonable O(k) time for matches, k is the average length of the lists of regexes in the inverted index. k can be quite small (less than 10) for many applications. The quality (false positive means bigger k, false negative means missed matches) of the inverted index depends on how well the indexer understands the regex syntax. If the regexes are generated by human experts, they can provide hints for contained fixed patterns as well.

ididak 2008-11-04 01:57:30

Answer 9

+1 A:

There is a Perl module that does just this Tie::Hash::Regex.

use Tie::Hash::Regex;
my %h;

tie %h, 'Tie::Hash::Regex';

$h{key}   = 'value';
$h{key2}  = 'another value';
$h{stuff} = 'something else';

print $h{key};  # prints 'value'
print $h{2};    # prints 'another value'
print $h{'^s'}; # prints 'something else'

print tied(%h)->FETCH(k); # prints 'value' and 'another value'

delete $h{k};   # deletes $h{key} and $h{key2};

Brad Gilbert 2008-11-04 04:31:23

Yes, I know about this module, and the described behavior is exactly what I want, but I peeked at the source code for this and it's really just iterating over the keys for each lookup. So it's really just an O(n) solution, albeit a convenient one.

Jeff 2008-11-04 16:21:19

Answer 10

A:

A special case of this problem came up in the 70s AI languages oriented around deductive databases. The keys in these databases could be patterns with variables -- like regular expressions without the * or | operators. They tended to use fancy extensions of trie structures for indexes. See krep*.lisp in Norvig's Paradigms of AI Programming for the general idea.

Darius Bacon 2008-11-04 05:02:57

Answer 11

+2 A:

This is definitely possible, as long as you're using 'real' regular expressions. A textbook regular expression is something that can be recognized by a deterministic finite state machine, which primarily means you can't have back-references in there.

There's a property of regular languages that "the union of two regular languages is regular", meaning that you can recognize an arbitrary number of regular expressions at once with a single state machine. The state machine runs in O(1) time with respect to the number of expressions (it runs in O(n) time with respect to the length of the input string, but hash tables do too).

Once the state machine completes you'll know which expressions matched, and from there it's easy to look up values in O(1) time.

2008-11-04 06:30:01

I vaguely recall reading something about this in Higher Order Perl but can't find location at the moment. Anyone else remember?

Michael Carman 2008-11-04 17:08:15

Answer 12

+1 A:

If you have a small set of possible inputs, you can cache the matches as they appear in a second dict and get O(1) for the cached values.

If the set of possible inputs is too big to cache but not infinite, either, you can just keep the last N matches in the cache (check Google for "LRU maps" - least recently used).

If you can't do this, you can try to chop down the number of regexps you have to try by checking a prefix or somesuch.

Aaron Digulla 2008-11-04 12:39:42

Answer 13

+1 A:

What you want to do is very similar to what is supported by xrdb. They only support a fairly minimal notion of globbing however.

Internally you can implement a larger family of regular languages than theirs by storing your regular expressions as a character trie.

single characters just become trie nodes.
.'s become wildcard insertions covering all children of the current trie node.
*'s become back links in the trie to node at the start of the previous item.
[a-z] ranges insert the same subsequent child nodes repeatedly under each of the characters in the range. With care, while inserts/updates may be somewhat expensive the search can be linear in the size of the string. With some placeholder stuff the common combinatorial explosion cases can be kept under control.
(foo)|(bar) nodes become multiple insertions

This doesn't handle regexes that occur at arbitrary points in the string, but that can be modeled by wrapping your regex with .* on either side.

Perl has a couple of Text::Trie -like modules you can raid for ideas. (Heck I think I even wrote one of them way back when)

Edward Kmett 2008-11-05 20:52:08

any implementation available?

bill 2009-06-16 10:16:16

Answer 14

A:

I created this exact data structure for a project once. I implemented it naively, as you suggested. I did make two immensely helpful optimizations, which may or may not be feasible for you, depending on the size of your data:

Memoizing the hash lookups
Pre-seeding the the memoization table (not sure what to call this... warming up the cache?)

To avoid the problem of multiple keys matching the input, I gave each regex key a priority and the highest priority was used.

2008-11-06 05:55:42

Answer 15

+1 A:

Hey,

What about the following:

class redict(dict):
def __init__(self, d):
 dict.__init__(self, d)

def __getitem__(self, regex):
 r = re.compile(regex)
 mkeys = filter(r.match, self.keys())
 for i in mkeys:
  yield dict.__getitem__(self, i)

It's basically a subclass of the dict type in Python. With this you can supply a regular expression as a key, and the values of all keys that match this regex are returned in an iterable fashion using yield.

With this you can do the following:

>>> keys = ["a", "b", "c", "ab", "ce", "de"]
>>> vals = range(0,len(keys))
>>> red = redict(zip(keys, vals))
>>> for i in red[r"^.e$"]:
...     print i
... 
5
4
>>>

2009-05-03 01:18:14

Functionally, this is fine, but performance-wise, it's still O(n) because filter() is O(n) (well, actually it's worse than O(n) because we have to match the regular expression against each key, which has a non-constant cost, but I assume that will be part of any solution). I'd like to search the keys in a better-than-O(n) way, if possible. Others suggested data structures such as tries that might make this possible.

Jeff 2009-05-04 19:44:57

ansaurus

tags:

views:

answers:

Hashtable/dictionary/map lookup with regular expressions

related questions