views:

90

answers:

6

I need to pick up some numbers from lines with irregular length, like this:

AAAAAAAAA 250.00
BBB 240.00
CCCCCCC 13.00

I need to capture 250.00, 240.00 and 13.00, but since both the numeric and character strings are irregular, I can't use "substr" for that, I think regex maybe the solution, but I dunno much about it.

Can anyone help? Thanks!

EDIT #1

Thanks for Joshua's speedy reply, but I'm afraid it is not exactly the case, as space and numbers may also present in AAAAAAAA, BBB, CCCCC strings.

Thanks again.

EDIT #2

Joshua I've tried, but still fail, I'll show a sample of the line here:

 01 JUN 2003 02 JUN 2002 OCTOPUS CARDS LTD HONG KONG HK 250.00

Hope it helps, and thanks again.

EDIT #3

Joshua thanks again.

Now I have a more complicated case:

 01 JUN 2003 02 JUN 2002 OCTOPUS CARDS LTD HONG KONG HK 834591283405347 250.00

The "OCTOPUS CARDS LTD HONG KONG HK" AND "834591283405347" are two separate columns and I want to extract them also, how can I do that?

A: 

If your actual problem is like your example:

> lapply(c("AAAAAAA 250.00","BBB 240.00"), function(x) strsplit(x, " ")[[1]][2]) 
[[1]]
[1] "250.00"

[[2]]
[1] "240.00"

EDIT 1: Since your actual problem isn't exactly like your example: ;-)

> y <- c("AAAAAAAAAAA 250.00","BBBBB 240.00","CC CC 120.00")
> FUN <- function(x) substr(x,regexpr("[0-9]",x),nchar(x))
> lapply(y,FUN)
[[1]]
[1] "250.00"

[[2]]
[1] "240.00"

[[3]]
[1] "120.00"

EDIT 2: Change FUN to:

> FUN <- function(x) tail(strsplit(x," ")[[1]],1)
Joshua Ulrich
When using `strsplit` is always better to do `lapply(strsplit(vect), function(x) {...})` rather than `lapply(vect, function(x) {strsplit(x)[[1]]; ...})`
Marek
Indeed it is. Thanks for pointing that out.
Joshua Ulrich
A: 

I'd like a little more info to be sure. But is it always "arbitrarytext [space] number"?

If so you could do something like

> read.csv("~/Desktop/test.txt", sep=" ", header=FALSE)
         V1  V2
1 AAAAAAAAA 250
2       BBB 240
3   CCCCCCC  13

Assuming you've saved the text to a file (~/Desktop/test.txt). The numerical strings are automatically coerced to numbers.

> sum(a$V2)
[1] 503
I82Much
+2  A: 

If you have only two columns in each of your lines of data, you can use read.table() with textConnection():

x = "AAAAAAAAA 250.00
BBB 240.00
CCCCCCC 13.00"

data = read.table(textConnection(x))

data

             V1  V2
    1 AAAAAAAAA 250
    2       BBB 240
    3   CCCCCCC  13

and

data[2]
   V2
1 250
2 240
3  13
Greg
+2  A: 

Unfortunately tail is relatively slow. Actually indexing the final item is much faster.

FUN <- function(x) {ss <- strsplit(x,' ')[[1]];ss[length(ss)]}

On my machine this is well over twice as fast as the tail command.

y <- c("AAAAAAAAAAA 250.00",
    "01 JUN 2003 02 JUN 2002 OCTOPUS CARDS LTD HONG KONG HK 5.13",
    "01 JUN 2003 02 JUN 2002 OCTOPUS CARDS LTD HONG KONG HK 834591283405347 50.00")    

#make y bigger so that there's something to test
y <- rep(y, 1e5)

#testing tail
FUN <- function(x) {tail(strsplit(x,' ')[[1]],1)}
system.time( lapply(y,FUN) )       
   user  system elapsed 
 22.108   0.110  22.069 

#testing indexing
FUN <- function(x) {ss <- strsplit(x,' ')[[1]];ss[length(ss)]}    
system.time( lapply(y,FUN) )
  user  system elapsed 
 9.396   0.037   9.372 

But even more speed is accomplished by separating the function out and taking advantage of the fact that components are already vectorized. (the whole point of apply family commands is not to replace looping but to allow simple syntax and use vectorized commands as much as possible. The simplest functions possible should go into lapply and such.)

#first let strsplit do it's own vectory magic
s <- strsplit(y, ' ')
#then define a simpler function
FUN <- function(x) x[length(x)]
lapply(s, FUN)

To time test this it's necessary to keep the strsplit inside the timing routine to make it fair

system.time( {s <- strsplit(y, ' ');lapply(s, FUN)} )

   user  system elapsed 
  5.281   0.048   5.305 

(I'm pretty sure I'm missing something on indexing lists and my function should be even simpler.)

One more thing though.. and this would have sped things up all the way through but I'll just add it here. strsplit() has a fixed option. It works much faster if you set that to true when you aren't using a regular expression.

system.time( {s <- strsplit(y, ' ', fixed = TRUE); lapply(s, FUN)} )
   user  system elapsed 
  1.256   0.007   1.253 

If you're doing this on a large dataset or you have to do it frequently on even moderately sized datasets you really ought to be using this last method. It's nearly 20x faster.

Here's the final solution that can just be copied over to accomplish the whole task, assuming that Y is a vector of character strings formatted just as expected in Edit #3. What is expected is that the last item is a money value to save and the second last item is some kind of ID value.

s <- strsplit(y, ' ', fixed = TRUE)
moneyVal <- lapply(s, function(x) x[length(x)])
   idVal <- lapply(s, function(x) x[length(x)-1])
 restOfY <- lapply(s, function(x) paste(x[1:(length(x)-2)], collapse = ' '))
#These three values can be combined into a data frame
df <- data.frame(restOfY, idVal, moneyVal)
John
Very nice answer. It should be the "accepted" answer.
Joshua Ulrich
+1 for `fixed` version.
Marek
One note: you should include post-edits: `unlist(lapply(s, FUN))` or faster version `unlist(lapply(s, FUN),recursive=FALSE, use.names=FALSE)`.
Marek
It's probably good to point that out but author doesn't say anything about format so I wasn't sure a list wouldn't be best. In keeping with speed assessment, unlist() seems to add no time... no measurable time with this size of a list with all numbers. Oh, and while commenting on the final format an as.numeric() might be handy as well. The result is in character format.
John
A: 

Using gsub:

y <- c(
    "AAAAAAAAAAA 250.00",
    "BBBBB 240.00",
    "CC CC 120.00",
    "01 JUN 2003 02 JUN 2002 OCTOPUS CARDS LTD HONG KONG HK 250.00",
    "01 JUN 2003 02 JUN 2002 OCTOPUS CARDS LTD HONG KONG HK 834591283405347 250.00"
)

gsub("(^.* )([0-9\\.]*$)", "\\2", y)
# [1] "250.00" "240.00" "120.00" "250.00" "250.00"

or

gsub("^.* ", "", y)
# [1] "250.00" "240.00" "120.00" "250.00" "250.00"

Check also this thread about dealing with substrings.

Marek
That will fail on negative numbers, or anything with exponential notation (1.23E+10) in it.
Spacedman
@Spacedman First one yes, but second don't. First could be extend to handle negative or exponential numbers.
Marek
A: 

You should look at the gsubfn package, especcially the strapply function.

The strapply function focuses on what you want to find (not what you want to split on, or what you want to remove).

A simple case is to just give it the pattern to match a number and it will give you all the numbers from the string, or you can include '$' to only get the numbers at the end of the string, or modify the pattern to match exactly what you want.

Greg Snow