tags:

views:

68

answers:

2

Hello,

I have a column of data in a R data frame that has values such as

Blue-#105
Green-#8845
Yellow-#5454
Blue-#999

I want to remove the last number part (starting at -#) so that Blue-#999 and Blue-#105 are consider the same thing when plotting. How could I accomplish this?

Thanks so much

+7  A: 

Use regular expressions:

> DF <- data.frame(col=c("Blue-#105", "Green-#8845", "Blue-#999"))
> DF
          col
1   Blue-#105
2 Green-#8845
3   Blue-#999
> DF$col <- gsub("-\\#.*", "", DF$col)
> DF
    col
1  Blue
2 Green
3  Blue
> 

Here we say that all strings starting with -# (where the comment char # needs to be escaped) and followed by whatever --- which is .* in regular expression lingo: any char (the dot) repeated as many times as it fits (the star) --- will get replaced by the empty string, or in other words, removed.

Dirk Eddelbuettel
+3  A: 

Use the sub or gsub function. For your example you could do something like:

newcolors <- sub("^([^-]*)-.*$", "\\1", oldcolors )

This assumes that the colors are in a vector 'oldcolors' and puts the results into newcolors. The pattern starts at the beginning of the string (^) then matches 0 or more characters that are not dashes ([^-]), the parens around that says to save what is matched. Then it matches a dash followed by further characters (.) until the end of the string ($), the matched portion (the entire string) is then replaced by whatever was matched within the parens (the color).

Greg Snow
Hey Greg, I like how concise your answer is, but I am getting an error: unexpected ',' in "newdatafr <- gsub("^([^-]*)-.*$")," newdatafr is equivalent to newcolors in your example.
Eric Brotto
@Eric : then I think you should copy-paste better. It works fine for me, and the error you provide does not show the same code as Greg posted here.
Joris Meys
FWIW my `gsub()` call is short / more concise than the `sub()` call shown here. Otherwise, they are of course essentially equivalent.
Dirk Eddelbuettel
Yes the 2 regex's are equivalent for the example data given. The difference is that Dirk's focuses on what to throw away and mine focuses on what to keep. Which is better would depend on possible differences in future data.
Greg Snow