tags:

views:

74

answers:

2

Dear SOFers,

I would like to cut a vector of values ranging 0-70 to x number of categories, and would like the upper limit of each category. So far, I have tried this using cut() and am trying to extract the limits from levels. I have a list of levels, from which I would like to extract the second number from each level. How can I extract the values between space and ] (which is the number I'm interested in)?

I have:

> levels(bins)
 [1] "(-0.07,6.94]" "(6.94,14]"    "(14,21]"      "(21,28]"      "(28,35]"     
 [6] "(35,42]"      "(42,49]"      "(49,56]"      "(56,63.1]"    "(63.1,70.1]" 

and would like to get:

[1] 6.94 14 21 28 35 42 49 56 63.1 70.1

Or is there a better way of calculating the upper bounds of categories?

+4  A: 

This could be one solution

k <- sub("^.*\\,","", levels(bins))
as.numeric(substr(k,1,nchar(k)-1))

gives

 [1]  6.94 14.00 21.00 28.00 35.00 42.00 49.00 56.00 63.10 70.10
gd047
So if I understand this correctly, the pattern string says "omit everything left of comma and trim spaces"?
Roman Luštrik
The first command substitutes everything before "," with nothing (""). The second one takes a substring of length n-1 (to omit the trailing "]")
gd047
actually '\\' in "^.*\\," is unnecessary, and full-regexp approach, though i don't recommend if you are not familiar with regexp, is just: sub(".*,(.*)]","\\1", levels(bins))
kohske
@kohske you are right, I should have used grouping
gd047
+1  A: 

If you want exact values of breaks then you should compute them yourself, cause cut round limits for interval:

x <- seq(0,1,by=.023)
levels(cut(x, 4))
# [1] "(-0.000989,0.247]" "(0.247,0.494]"     "(0.494,0.742]"     "(0.742,0.99]"     
levels(cut(x, 4, dig.lab=10))
# [1] "(-0.000989,0.2467555]" "(0.2467555,0.4945]"    "(0.4945,0.7422445]"   
# [4] "(0.7422445,0.989989]" 

You could look on code to cut.default how breaks are compute:

if (length(breaks) == 1L) {
    if (is.na(breaks) | breaks < 2L) 
        stop("invalid number of intervals")
    nb <- as.integer(breaks + 1)
    dx <- diff(rx <- range(x, na.rm = TRUE))
    if (dx == 0) 
        dx <- abs(rx[1L])
    breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000, 
        length.out = nb)
}

So easy way is to grab this code and put into a function:

compute_breaks <- function(x, breaks) 
    if (length(breaks) == 1L) {
        if (is.na(breaks) | breaks < 2L) 
            stop("invalid number of intervals")
        nb <- as.integer(breaks + 1)
        dx <- diff(rx <- range(x, na.rm = TRUE))
        if (dx == 0) 
            dx <- abs(rx[1L])
        breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000, 
            length.out = nb)
            breaks
    }

Result is

compute_breaks(x,4)
# [1] -0.000989  0.246755  0.494500  0.742244  0.989989
Marek