tags:

views:

181

answers:

5

Hi,

I'm trying to extract a number from a string.

And do something like this [0-9]+ on this string "aaaa12xxxx" and get "12".

I thought it would be something like:

> grep("[0-9]+","aaa12xxx", value=TRUE)
[1] "aaa12xxx"

And then I figured...

> sub("[0-9]+", "\\1", "aaa12xxxx")
[1] "aaa12xxx"

But I got some form of response doing:

> sub("[0-9]+", "ARGH!", "aaa12xxxx")
[1] "aaaARGH!xxx"

There's a small detail I'm missing Please advice :-)

I'm using R version 2.10.1 (2009-12-14)

Thanks !


Comments on the solution

The best solution is to ignore the standard functions and install Hadley Wickham's stringr package to get something that actually makes sense.

Kudos to Marek for figuring out how the standard library worked.

+2  A: 

Maybe

gsub("[^0-9]", "", "aaa12xxxx")
# [1] "12"
Marek
Thanks :) R isn't the most intuitive thing to use.
tovare
A: 

One way would be this:

test <- regexpr("[0-9]+","aaa12456xxx")

Now, notice regexpr gives you the starting and ending indices of the string:

    > test
[1] 4
attr(,"match.length")
[1] 5

So you can use that info with substr function

substr("aaa12456xxx",test,test+attr(test,"match.length")-1)

I'm sure there is a more elegant way to do this, but this was the fastest way I could find. Alternatively, you can use sub/gsub to strip out what you don't want to leave what you do want.

Robert
+10  A: 

Use the new stringr package which wraps all the existing regular expression operates in a consistent syntax and adds a few that are missing:

library(stringr)
str_locate("aaa12xxx", "[0-9]+")
#      start end
# [1,]     4   5
str_extract("aaa12xxx", "[0-9]+")
# [1] "12"
hadley
Consistency of syntax! what a freaking brilliant idea! :)
JD Long
Thanks, I wasn't aware of your package. Thanks for bringing it to my attention.
tovare
Wonderfully useful function Hadley - Thank you!
Tal Galili
+1  A: 

You can use PERL regexs' lazy matching:

> sub(".*?([0-9]+).*", "\\1", "aaa12xx99",perl=TRUE)
[1] "12"

Trying to substitute out non-digits will lead to an error in this case.

Jyotirmoy Bhattacharya
Do not need PERL if you are willing to use the slightly uglier "[^0-9]*([0-9]+).*"
Jyotirmoy Bhattacharya
+1  A: 

Using strapply in the gsubfn package. strapply is like apply in that the args are object, modifier and function except that the object is a vector of strings (rather than an array) and the modifier is a regular expression (rather than a margin):

library(gsubfn)
x <- c("xy13", "ab 12 cd 34 xy")
strapply(x, "\\d+", as.numeric)
# list(13, c(12, 34))

This says to match one or more digits (\d+) in each component of x passing each match through as.numeric. It returns a list whose components are vectors of matches of respective components of x. Looking the at output we see that the first component of x has one match which is 13 and the second component of x has two matches which are 12 and 34. See http://gsubfn.googlecode.com for more info.

G. Grothendieck