ansaurus

Question

Answer 1

A:

You could first order the strings by length ( O(N) ) and then only check smaller strings to be substrings or larger strings, plus only check with levenshtein in string pairs for which the difference is not too large.

You already perform these checks, but now you do it for all N x N pairs, while preselecting first by length will help you reduce the pairs to check first. Avoid the N x N loop, even if it contains only tests that will fail.

For substring matching you could further improve by creating an index for all smaller items, and update this accordingly as you parse larger items. The index should can form a tree structure branching on letters, where each word (string) forms a path from root to leaf. This way you can find if any of the words in the index compare to some string to match. For each character in your match string try to proceed any pointers in the tree index, and create a new pointer at the index. If a pointer can not be proceeded to a following character in the index, you remove it. If any pointer reaches a leaf note, you've found a substring match. Implementing this is, I think, not difficult, but not trivial either.

catchmeifyoutry 2010-05-20 23:03:41

I was thinking about ordering by length, but worried about the additional overhead for sorting on output, as it would need to use `usort()`. Due to the small result set though, this was minuscule compared to the time savings of reducing the loop iterations.

GApple 2010-05-31 22:56:38

Ordering by length cut the time in almost exactly half on my windows development machine, as could be expected. Bizarrely, on the production Linux system time actually increased 2 to 3 times initially. In transitioning from a `foreach` loop to `for`, I calculated the length of the values array each time; calculating once and storing it in a variable instead reduced time on windows another half, and on linux to 5% of the original `foreach` loop's time.

GApple 2010-05-31 22:58:36

Hmm, that's weird. Running the exact same code on windows and linux?

catchmeifyoutry 2010-05-31 23:19:45

Answer 2

A:

You can get an instant 100% improvement by tightening your inner loop. Aren't you getting duplicate matches in your results?

For a preprocess step I'd go through and calculate character frequencies (assuming your set of characters is small like a-z0-9, which, given that you're using stripos, I think is likely). Then rather than comparing sequences (expensive) compare frequencies (cheap). This will give you false positives which you can either live with, or plug into the test you've currently got to weed out.

CurtainDog 2010-05-20 23:27:31

All items in the values array are unique. There's also a comparison to skip over comparing items to themselves.

GApple 2010-05-20 23:40:45

ansaurus

tags:

views:

answers:

Optimizing near-duplicate value search

related questions