ansaurus

Question

Answer 1

+1 A:

Can you explain why this was the case?

Easy. preg_match is implemented in C. The other solutions are implemented in PHP. Now, that doesn't mean a regex will always be faster than the equivalent PHP, but most of the time, it probably will be.

I recently had a similar situation, where I had a function (a CamelCase converter) that was being called 10s of thousands of times, and taking a fair amount of CPU (I profiled). I tried every PHP reimplementation I could dream up. The preg_replace was always faster. In the end, I left the function as it was, and memoized it, which did the trick.

In many cases, the fewer PHP statements executed, the better. If you can replace a loop with a single call to a function that's implemented in C under the hood, that may be your best bet.

really it is less readable/maintainable than the loops

I disagree. One-liners are as simple as it gets. Although I'd probably go with something more like

function preg_check($rejectedStrs, $input) {
    return preg_match($rejectedStrs, "", $input);
}

Frank Farmer 2010-08-20 02:36:11

Note that preg_match wasn't viable for this, since it only accepts a single pattern. Still, your point holds. I hadn't considered that built-ins would be implemented in a compiled language. Would it change things if I were using something like APC to speed up my php?

JGB146 2010-08-20 02:41:24

As to the loops vs one-liners, I said loops were more readable/maintainable because the intention is more clear and understandable to the average programmer. As evidenced by the -1 my answer still holds on the question I'm referencing.

JGB146 2010-08-20 02:43:17

I was very tempted to downvote this. "preg_match is implemented in C" is a simplistic explanation. A more time complex algorithm in C will always be (asymptotically) slower than one in PHP.

Artefacto 2010-08-20 02:51:44

@Artefacto: Would you agree that this is the reason for the performance I saw in my original tests?

JGB146 2010-08-20 03:53:48

"Note that preg_match wasn't viable for this, since it only accepts a single pattern" -- you can always combine multiple words into a single pattern: `preg_match('/(loop|efficiency|explain)/', $str);`

Frank Farmer 2010-08-20 16:39:27

Answer 2

+1 A:

Let's first look at preg_check and loop_check. Both of them will have to traverse the entire string, and they will have to check each of the individual words in each traversal. So their behavior will at least be O(n*m), where n is the length of the string and m the number of bad words. You can test this by running the algorithm with increasing values of n and m and plotting the 3D graphs (however, you may, or may not, have to run it with very high values of n and m to see this behavior).

loop_check is more (asymptoticly) efficient here. The reason is that the number of words a string has is not proportional to their length -- I seem to recall it typically follows a logarithmic function. It probably uses a hash table to store the words it finds through the way, which is done in average constant time (if we ignore that we may have to rebuild the hash table from time to time to accommodate more elements).

Therefore loop_check will have an asymptotic behavior that follows something like n + m * log(n), which is better than n*m.

Now, this refers to the asymptotic behavior of the algorithms, i.e., when m and n grow very (and it may require "very very") large. For small values of m and n the constants play a big part. In particular, execution of PHP opcodes and PHP function calls are more costly than the same task implemented in C, just one function call away. This doesn't make the regex algorithm faster, it just makes it faster for small values of m and n.

Artefacto 2010-08-20 03:21:46

Interesting. I had assumed that the nested nature of `loop_check` would lead its performance to *worsen* as the size of inputs increased. I went ahead and tested this with the same code as in the question, but inputs of ~20x `$input` along with ~50 bad words. And to your point, I got the results you expected: `loop_check` far outperformed, at apx 14s vs 21s and 25s for `preg_check` and `str_check`. In the end, I guess it comes down to how long a string you expect to check on average, and how many words you will check against.

JGB146 2010-08-20 03:51:15

@JGB you can't just look at the loops you see in the PHP code; the internal functions' implementations are also capable to loop, do recursive calls, etc. Also important is how many times they loop.

Artefacto 2010-08-20 09:41:56

@JGB And I repeat, if you care about spee, you can write an algoritm that turns the bad words into a trie tree and is able to run in O(n). If you implement in C in a PHP extension, you can be sure it'll also beat the other solutions from small values of m and n.

Artefacto 2010-08-20 09:44:58

@JGB Sorry, a trie implementation will be O(n+m), because building the trie takes linear time on `m`. But it's still better than O(n + m * log(n)).

Artefacto 2010-08-20 10:36:58

ansaurus

tags:

views:

answers:

Efficiency of Preg_replace

Executive Summary:

Results

related questions