views:

399

answers:

5

I'm trying to figure out how to filter out duplicates in a string with a regular expression, where the string is comma separated. I'd like to do this in javascript, but I'm getting caught up with how to use the back-references.

For example:

1,1,1,2,2,3,3,3,3,4,4,4,5

Becomes:

1,2,3,4,5

Or:

a,b,b,said,said, t, u, ugly, ugly

Becomes

a,b,said,t,u,ugly
A: 

Here's a example:

s/,([^,]+),\1/,$1/g;

Perl regex substitution, but should be convertible to JS-style by anyone who knows the syntax.

Anon.
Note that this doesn't quite work correctly around the start of the string - I could fix that, but that would obscure how the core of the regex itself works. Which is a bad thing, because it ends up encouraging people to copy-paste without understanding.
Anon.
+6  A: 

Why use regex when you can do it in javascript code? Here is sample code (messy though):

var input = 'a,b,b,said,said, t, u, ugly, ugly';
var splitted = input.split(',');
var collector = {};
for (i = 0; i < splitted.length; i++) {
   key = splitted[i].replace(/^\s*/, "").replace(/\s*$/, "");
   collector[key] = true;
}
var out = [];
for (var key in collector) {
   out.push(key);
}
var output = out.join(','); // output will be 'a,b,said,t,u,ugly'

p/s: that one regex in the for-loop is to trim the tokens, not to make them unique

Lukman
+1 this has the added benefit of removing duplicates even if they aren't contiguous. Something that would be exceedingly difficult if no t impossible to do in a regex.
Jeremy Wall
Regular expressions are often far more elegant for the problems they can solve easily, though. Which is preferable - a dozen lines of code, or a dozen characters of regex?
Anon.
I would recommend you to check if `collector.hasOwnProperty(key)` inside your `for...in` loop, because if someone extends the `Object.prototype` it will break your code.
CMS
Anon, fair point, but processing CSV is not one of those problems. Also, elegance is very subjective in programming.
Ash
+1  A: 

If you insist on RegExp, here's an example in Javascript:

"1,1,1,2,2,3,3,3,3,4,4,4,5".replace (
    /(^|,)([^,]+)(?:,\2)+(,|$)/ig, 
    function ($0, $1, $2, $3) 
    { 
        return $1 + $2 + $3; 
    }
);

To handle trimming of whitespace, modify slightly:

"1,1,1,2,2,3,3,3,3,4,4,4,5".replace (
    /(^|,)\s*([^,]+)\s*(?:,\s*\2)+\s*(,|$)\s*/ig, 
    function ($0, $1, $2, $3) 
    { 
        return $1 + $2 + $3; 
    }
);

That said, it seems better to tokenise via split and handle duplicates.

K Prime
A: 

I don't use Regular Expressions for that.

Here's the function I use. It accepts a string containing comma separated values and returns an array of unique values regardless of position in the original string.

Note: If you pass CSV string containing quoted values, Split will not treat commas inside quoted values any differently. So if you want to handle real CSV, you are best to use a 3rd party CSV parser.

function GetUniqueItems(s)
{
    var items=s.split(",");

    var uniqueItems={};

    for (var i=0;i<items.length;i++)
    {           
        var key=items[i];
        var val=items[i];
        uniqueItems[key]=val;
    }

    var result=[];

    for(key in uniqueItems)
    {
        // Assign to output result field using hasOwnProperty so we only get 
        // relevant items
        if(uniqueItems.hasOwnProperty(key))
        {
            result[result.length]=uniqueItems[key];
        }
    }    
    return result;
}
Ash
A: 

With javascript regex

x="1,1,1,2,2,3,3,3,3,4,4,4,5"

while(/(\d),\1/.test(x))
    x=x.replace(/(\d),\1/g,"$1")

1,2,3,4,5


x="a,b,b,said,said, t, u, ugly, ugly"

while(/\s*([^,]+),\s*\1(?=,|$)/.test(x))
    x=x.replace(/\s*([^,]+),\s*\1(?=,|$)/g,"$1")

a,b,said, t, u,ugly

Not well tested, let me know if there is any issue.

S.Mark