views:

1895

answers:

8

I could solve this using loops, but I am trying think in vectors so my code will be more R-esque.

I have a list of names. The format is firstname_lastname. I want to get out of this list a separate list with only the first names. I can't seem to get my mind around how to do this. Here's some example data:

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
tsplit <- strsplit(t,"_")

which looks like this:

> tsplit
[[1]]
[1] "bob"   "smith"

[[2]]
[1] "mary" "jane"

[[3]]
[1] "jose"  "chung"

[[4]]
[1] "michael" "marx"   

[[5]]
[1] "charlie" "ivan"

I could get out what I want using loops like this:

for (i in 1:length(tsplit)){
    if (i==1) {t_out <- tsplit[[i]][1]} else{t_out <- append(t_out, tsplit[[i]][1])} 
}

which would give me this:

t_out
[1] "bob"     "mary"    "jose"    "michael" "charlie"

So how can I do this without loops?

+3  A: 

You almost had it. It really is just a matter of

  1. using one of the *apply functions to loop over your existing list, I often start with lapply and sometimes switch to sapply
  2. add an anonymous function that operates on one of the list elements at a time
  3. you already knew it was strsplit(string, splitterm) and that you need the odd [[1]][1] to pick off the first term of the answer
  4. just put it all together, starting with a preferred variable namne (as we stay clear of t or c and friends)

which gives

> tlist <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan") 
> fnames <- sapply(tlist, function(x) strsplit(x, "_")[[1]][1]) 
> fnames 
  bob_smith    mary_jane   jose_chung michael_marx charlie_ivan   
      "bob"       "mary"       "jose"    "michael"    "charlie" 
>
Dirk Eddelbuettel
I really have struggled with getting my mind around properly using the apply functions in R. Some days it feels like learning to drive on the opposite side of the road.. it's really not hard but the simple round-a-bouts result in a mental log jam.
JD Long
I do it in a leg-alike fashion. You knew strsplit. You knew you needed an 'anon function' of one parameter for the apply family. Just stick'em together.... Lastly, and not to nit-pick, I posted this before the essentially identical but less verbose answer you accepted as 'the' answer.
Dirk Eddelbuettel
Typo: 'lego-alike', not 'leg-alike'
Dirk Eddelbuettel
Dirk, one of the things I have noticed about being a novice at R is that it is very hard to see that two given problems are similar. I think with expertise comes the ability to chose meaningful analogies quickly. I'm slowly getting to where I can see patterns. I appreciate your comment above about figuring out what the lego bricks are. I'm still growing in my ability to look at a problem and see that I need an anon function, for example.
JD Long
+2  A: 

You could use unlist():

> tsplit <- unlist(strsplit(t,"_"))
> tsplit
 [1] "bob"     "smith"   "mary"    "jane"    "jose"    "chung"   "michael"
 [8] "marx"    "charlie" "ivan"   
> t_out <- tsplit[seq(1, length(tsplit), by = 2)]
> t_out
[1] "bob"     "mary"    "jose"    "michael" "charlie"

There might be a better way to pull out only the odd-indexed entries, but in any case you won't have a loop.

brentonk
Not ideal as you need impose the 'by = 2' to pick the matching elements.
Dirk Eddelbuettel
+3  A: 

I doubt this is the most elegant solution, but it beats looping:

t.df <- data.frame(tsplit)
t.df[1, ]

Converting lists to data frames is about the only way I can get them to do what I want. I'm looking forward to reading answers by people who actually understand how to handle lists.

Matt Parker
I like this. I 'get' the data.frame structure. And since my real data has the same number of items in each "name" then this should not be less memory efficient. Why didn't I think of this!
JD Long
Note that this approach takes a hell of a long time with larger data - see my comment on William Doane's answer.
Matt Parker
+5  A: 

You can use apply (or sapply)

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")

f <- function(s) strsplit(s, "_")[[1]][1]

sapply(t, f)

bob_smith mary_jane jose_chung michael_marx charlie_ivan

   "bob"       "mary"       "jose"    "michael"    "charlie"

David

liebke
that is exactly what I was trying to do. thank you. And welcome to Stack Overflow. I've enjoyed reading your blog.
JD Long
Thanks, I enjoy your blog (and tweets) too.
liebke
+7  A: 

How about:

tlist <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
fnames <- gsub("(_.*)$", "", tlist)
# _.* matches the underscore followed by a string of characters
# the $ anchors the search at the end of the input string
# so, underscore followed by a string of characters followed by the end of the input string

for the RegEx approach?

William Doane
+1 for being the fastest. With rep(t, 1e4), my approach took 83.23 seconds (81.41 of which were spent converting to a data frame!), David's took 4.39s, and yours took 0.81. I think it has the best output, too.
Matt Parker
Thanks, Matt... I was wondering about the efficiency of each of these solutions!
William Doane
that's really informative. I had just assumed the strsplit bit was a given. Wow. Good to see another way of doing it.
JD Long
+2  A: 

And one other approach, based on brentonk's unlist example...

tlist <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
tsplit <- unlist(strsplit(tlist,"_"))
fnames <- tsplit[seq(1:length(tsplit))%%2 == 1]

William Doane
+9  A: 

And one more approach:

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")
pieces <- strsplit(t,"_")
sapply(pieces, "[", 1)

In words, the last line extracts the first element of each component of the list and then simplifies it into a vector.

How does this work? Well, you need to realise an alternative way of writing x[1] is "["(x, 1), i.e. there is a function called [ that does subsetting. The sapply call applies calls this function once for each element of the original list, passing in two arguments, the list element and 1.

The advantage of this approach over the others is that you can extract multiple elements from the list without having to recompute the splits. For example, the last name would be sapply(pieces, "[", 2). Once you get used to this idiom, it's pretty easy to read.

hadley
Hadley, I see this works, but I haven't the slightest idea why it works. Is there an implied "]" somehow? Can you elaborate a bit? My R-foo is clearly weak.
JD Long
I was a little shocked by this, too, JD... so after a little playing, I see that: > "["(pieces,1) yields [[1]] [1] "bob" "smith" ... an interesting notation, to be sure, and very useful!
William Doane
Just as a side note, if you are going to split on fixed strings instead of regexps, you might want to consider passing `fixed=TRUE` to `strsplit`. I've found that this can have a large impact on the speed of `strsplit`.
Jonathan Chang
All operators in R are functions - infix operators can be written in prefix notation. TRUE || FALSE can be written as `||`(TRUE,FALSE), a[b] can be written as `[`(a,b), and even assignment operators a[b] <- TRUE is `[<-`(a,b,value=TRUE). R is magic.
Stephen
Not sure if it came out correctly there but there should be quotes (I used backtick but regular quotes should also work) around the prefix functions.
Stephen
thanks for posting an explanation. That makes sense to me now. The [ function was totally new to me.
JD Long
I love that this works, and I love Stephen's comment "R is magic". It's so true !
PaulHurleyuk
+3  A: 

what about:

t <- c("bob_smith","mary_jane","jose_chung","michael_marx","charlie_ivan")

sub("_.*", "", t)
Karsten
that totally works! Thanks.
JD Long