views:

70

answers:

2

I'd like to find the longest repeating string within a string, implemented in JavaScript and using a regular-expression based approach.

I have an PHP implementation that, when directly ported to JavaScript, doesn't work.

The PHP implementation is taken from an answer to the question "Find longest repeating strings?":

preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $input, $matches, PREG_SET_ORDER);

This will populate $matches[0][X] (where X is the length of $matches[0]) with the longest repeating substring to be found in $input. I have tested this with many input strings and found am confident the output is correct.

The closest direct port in JavaScript is:

var matches = /(?=((.+)(?:.*?\2)+))/.exec(input);

This doesn't give correct results

input                  Excepted result   matches[0][X]
======================================================
inputinput             input             input
7inputinput            input             input
inputinput7            input             input
7inputinput7           input             7
XXinputinputYY         input             XX

I'm not familiar enough with regular expressions to understand what the regular expression used here is doing.

There are certainly algorithms I could implement to find the longest repeating substring. Before I attempt to do that, I'm hoping a different regular expression will produce the correct results in JavaScript.

Can the above regular expression be modified such that the expected output is returned in JavaScript? I accept that this may not be possible in a one-liner.

A: 

It seems JS regexes are a bit weird. I don't have a complete answer, but here's what I found.

Although I thought they did the same thing re.exec() and "string".match(re) behave differently. Exec seems to only return the first match it finds, whereas match seems to return all of them (using /g in both cases).

On the other hand, exec seems to work correctly with ?= in the regex whereas match returns all empty strings. Removing the ?= leaves us with

re = /((.+)(?:.*?\2)+)/g

Using that

"XXinputinputYY".match(re);

returns

["XX", "inputinput", "YY"]

whereas

re.exec("XXinputinputYY");

returns

["XX", "XX", "X"]

So at least with match you get inputinput as one of your values. Obviously, this neither pulls out the longest, nor removes the redundancy, but maybe it helps nonetheless.

One other thing, I tested in firebug's console which threw an error about not supporting $1, so maybe there's something in the $ vars worth looking at.

Sid_M
+1  A: 

Javascript matches only return the first match -- you have to loop in order to find multiple results. A little testing shows this gets the expected results:

function maxRepeat(input) {
 var reg = /(?=((.+)(?:.*?\2)+))/g;
 var sub = ""; //somewhere to stick temp results
 var maxstr = ""; // our maximum length repeated string
 reg.lastIndex = 0; // because reg previously existed, we may need to reset this
 sub = reg.exec(input); // find the first repeated string
 while (!(sub == null)){
  if ((!(sub == null)) && (sub[2].length > maxstr.length)){
   maxstr = sub[2];
  }
  sub = reg.exec(input);
  reg.lastIndex++; // start searching from the next position
 }
 return maxstr;
}

// I'm logging to console for convenience
console.log(maxRepeat("aabcd"));             //aa
console.log(maxRepeat("inputinput"));        //input
console.log(maxRepeat("7inputinput"));       //input
console.log(maxRepeat("inputinput7"));       //input
console.log(maxRepeat("7inputinput7"));      //input
console.log(maxRepeat("xxabcdyy"));          //x
console.log(maxRepeat("XXinputinputYY"));    //input

Note that for "xxabcdyy" you only get "x" back, as it returns the first string of maximum length.

Ben Doom