Note: I am the author of RegexKit et al.
This is a fairly complicated answer.. :)
First, matching a thousand regexes with any of commonly available regex engine implementations is going to be fairly slow, save for perhaps the TCL and TRE regex engines. The reason why RegexKit.framework
greatly outperforms RegexKitLite
for this task is RegexKit.framework
has quite a bit of non-trivial, optimized code for just this task. The reason for this is because it's used in Safari AdBlock, which needs to perform bulk matches of regexes against URLs. It keeps the list of regexes in sorted order, based on the number of times they made a successful match. This is based on the observation that some regex patterns used in Safari AdBlock match much more frequently than others, and trying those first dramatically reduces the amount of regexes that need to be tried to determine if there's a 'hit'. There is also a small negative hit cache as well, along with a lot of multithreading code to do the matches in parallel. None of this will ever make it in to the Lite
version as it is definitely not a light-weight feature- there's probably 60-70KB of code just to implement this one feature alone, not to mention the huge memory footprint of keeping a thousand compiled regexes around.
Using RegexKitLite
to do this kind of pattern matching is bound to be very, very slow. The first problem is that it only keeps a small cache of compiled regexes that have recently been used. By default, the cache is set to just 23
, so tossing a thousand regexes at it is going to cause every regex to be compiled each time its used.
As others have pointed out, RegexKit.framework
isn't really set up to be used on the iPhone. Even if you got around the "linking to external frameworks" provision, the default build of RegexKit.framework
does not include the arm
architecture in its fat binary (it includes ppc
, ppc64
, i386
, and x86_64
). What you really need to do is set up a new build target that creates a static library. Not terribly hard to do, really.
I'm afraid that if this kind of pattern matching is something you need to do, you're probably going to have to roll your own regex engine. What you need is a regex engine that can take your thousand regexes and concatenate them together, such as "r1|r2|r3|r4
". Most regex engines, and in particular pcre
and ICU
(the ones used by RegexKit.framework
and RegexKitLite
, respectively), evaluate such a regex in an almost left to right manner. What's needed is an almost DFA like engine that evaluates all possible states concurrently. See this link for more information. I've built such a regex engine, one that even handles back-references (much easier to do than everyone says) in ~O(M*log2(N))
(M being the size of the text to match, N being the size of the regex) time, but it's not finished. If it was, it would cut through this kind of problem like a plasma torch through butter.
I am aware of at least one person porting RegexKit.framework
to the iPhone, though: Mobile Safari AdBlock. AFAIK, it's also a port of the desktop version of Safari AdBlock. I don't know many details, but I think it requires a jail-broken iPhone to install.
In summary, I don't think there's any turn-key solutions available for iPhone development that do anything close to what you need. Your best bet, other than creating your own regex engine, is to look in to the TRE regex engine and try some experiments using concatenated regexes. Be prepared to roll up your sleeves, though, as you're going to have to get your hands very dirty and deal with the guts of Cocoa's strings, Unicode encodings, and all kinds of other unpleasant stuff- the kind of stuff that RegexKitLite
takes care of for you behind the scenes.