views:

771

answers:

3

For my app I need to see if an url is Matched by a regex string. so I created an array with all the regex strings (about 1000+ strings) and check them using RegexKit lite:

for (NSString * aString in mainDelegate.whiteListArray) {

if (![urlString isMatchedByRegex:aString]) {

it works but sadly this operation takes very very long. at least 20 seconds for a webpage like google.com

I've tried using the "normal" RegexKit.framework, because it has an method called (BOOL)isMatchedByAnyRegexInArrayNSArray *)regexArray which is much faster. I can build the app, but whenever I try to launch it it crashes with the following error:

dyld: Library not loaded: @executable_path/../Frameworks/RegexKit.framework/Versions/A/RegexKit Referenced from: /Users/Reilly/Library/Application Support/iPhone Simulator/User/Applications/7E057EA8-5CD1-465B-8102-38A53A9B5F5B/Drowser.app/Drowser Reason: image not found

I guess it's because the RegexKit is not meant for arm? (to include the RegexKit I followed the how to which comes in the documentation)

so my question are:

  1. Do you know of any faster way to check a string if it's being matched by any of 1000 regexs.

  2. or do you know how to use the "normal" RegexKit on iPhone or any other regex framework which would do what I need in under a second?

thanks in advance

A: 

Are you copying RegexKit.framework into the frameworks folder of your iPhone app?

Dave DeLong
I followed the instructions of the documentation: http://regexkit.sourceforge.net/Documentation/index.htmlincluding Adding a Copy Files build phase to your executable target.
Yllier
+1  A: 

iPhone does not support embedded frameworks, so the directions there would not work even if it was built for arm. You can only use things that are statically linked, so you will either need to modify regexkit to build as a static library, or include it's source code directly in your project.

Louis Gerbarg
+6  A: 

Note: I am the author of RegexKit et al.

This is a fairly complicated answer.. :)

First, matching a thousand regexes with any of commonly available regex engine implementations is going to be fairly slow, save for perhaps the TCL and TRE regex engines. The reason why RegexKit.framework greatly outperforms RegexKitLite for this task is RegexKit.framework has quite a bit of non-trivial, optimized code for just this task. The reason for this is because it's used in Safari AdBlock, which needs to perform bulk matches of regexes against URLs. It keeps the list of regexes in sorted order, based on the number of times they made a successful match. This is based on the observation that some regex patterns used in Safari AdBlock match much more frequently than others, and trying those first dramatically reduces the amount of regexes that need to be tried to determine if there's a 'hit'. There is also a small negative hit cache as well, along with a lot of multithreading code to do the matches in parallel. None of this will ever make it in to the Lite version as it is definitely not a light-weight feature- there's probably 60-70KB of code just to implement this one feature alone, not to mention the huge memory footprint of keeping a thousand compiled regexes around.

Using RegexKitLite to do this kind of pattern matching is bound to be very, very slow. The first problem is that it only keeps a small cache of compiled regexes that have recently been used. By default, the cache is set to just 23, so tossing a thousand regexes at it is going to cause every regex to be compiled each time its used.

As others have pointed out, RegexKit.framework isn't really set up to be used on the iPhone. Even if you got around the "linking to external frameworks" provision, the default build of RegexKit.framework does not include the arm architecture in its fat binary (it includes ppc, ppc64, i386, and x86_64). What you really need to do is set up a new build target that creates a static library. Not terribly hard to do, really.

I'm afraid that if this kind of pattern matching is something you need to do, you're probably going to have to roll your own regex engine. What you need is a regex engine that can take your thousand regexes and concatenate them together, such as "r1|r2|r3|r4". Most regex engines, and in particular pcre and ICU (the ones used by RegexKit.framework and RegexKitLite, respectively), evaluate such a regex in an almost left to right manner. What's needed is an almost DFA like engine that evaluates all possible states concurrently. See this link for more information. I've built such a regex engine, one that even handles back-references (much easier to do than everyone says) in ~O(M*log2(N)) (M being the size of the text to match, N being the size of the regex) time, but it's not finished. If it was, it would cut through this kind of problem like a plasma torch through butter.

I am aware of at least one person porting RegexKit.framework to the iPhone, though: Mobile Safari AdBlock. AFAIK, it's also a port of the desktop version of Safari AdBlock. I don't know many details, but I think it requires a jail-broken iPhone to install.

In summary, I don't think there's any turn-key solutions available for iPhone development that do anything close to what you need. Your best bet, other than creating your own regex engine, is to look in to the TRE regex engine and try some experiments using concatenated regexes. Be prepared to roll up your sleeves, though, as you're going to have to get your hands very dirty and deal with the guts of Cocoa's strings, Unicode encodings, and all kinds of other unpleasant stuff- the kind of stuff that RegexKitLite takes care of for you behind the scenes.

johne
thank you very much for this detailed answer. now most things are clear for me. I don't think it's worth going through all that pain to code my own regex engine, especially since I'm still quite new to development :/but if AdBlock for mobile safari is really an port of Safari AdBlock (which is released under GPL) the developer will have to share his source with me when I buy it. then I can at least see how he ported RegexKit to the iPhone.
Yllier