ansaurus

Question

R only: Frequencies of all subsequences of size 3 in a given 0-1 sequnce?

Answer 1

+4 A:

Well, it seems like you would first need to generate n-tuples from your vector. The following function should accomplish that:

makeTuples <- function( x, n ){

  # Very inefficient way to loop... but what the heck
  tuples <- list()

  for( i in 1:n ){

    tuples[[i]] <- x[i:(length(x)-n+i)]

  }

  return(tuples)

}

Then you could feed the results of makeTuples() to table() using do.call():

do.call( table, makeTuples(s,3) )

, ,  = 0


    0 1
  0 4 1
  1 3 1

, ,  = 1


    0 1
  0 2 1
  1 0 1

This works because the makeTuples() function returns the tuples as a list of lists. The output isn't quite as nice as you wanted, but you could write a function to reformat, say:

To:

     0 1
  00 4 1
  01 3 1

It would require looping over the outer n-2 dimensions of the n-dimensional array returned by table, creating row names and concatenating things together.

Update

So, I was just sitting in a Stochastic processes class when I figured out a more or less straight-forward way to produce the output you want without trying to unwind the output of table(). First you will need a function that generates all possible permutations of n selections from your population. The generation of permutations can be done with expand.grid(), but it needs a little sugar-coating:

permute <- function( population, n ){

  permutations <- do.call( expand.grid, rep( list(population), n ) )

  permutations <- apply( permutations, 1, paste, collapse = '' )

  return( permutations )

}

The basic idea is to iterate over the list of permutations and count the number of tuples that match the given permutation. Since you want the results split out into a table, we should select a permutation of n-1 elements from the population and let the last position form the columns of the table. Here's a function that takes a permutation of size n-1, a list of tuples, and the population the tuples were drawn from and produces a named vector of match counts:

countFrequency <- function(permutation,tuples,population){

  permutations <- paste( permutation, population, sep = '' )

  # Inner lapply applies the equality operator `==` to each
  # permutation and returns a list of TRUE/FALSE vectors.
  # Outer lapply sums the number of TRUE values in each vector. 
  frequencies <- lapply(lapply(permutations,`==`,tuples),sum)

  names( frequencies ) <- as.character( population )

  return( unlist(frequencies) )

}

Finally, all three functions can be combined into a bigger function that takes a vector, splits it into n-tuples and returns a frequency table. The final aggregation operation is done using ldply() from Hadley Wickham's plyr package as it does a nice job of preserving information such as which permutation corresponds to which row of output matches:

permutationFrequency <- function( vector, n, population = unique( vector ) ){

  # Split the vector into tuples.
  tuples <- makeTuples( vector, n )

  # Coerce and compact the tuples to a vector of strings.
  tuples <- do.call(cbind,tuples)
  tuples <- apply( tuples, 1, paste, collapse = '' )

  # Generate permutations of n-1 elements from the population.
  # Turn into a named list for ldply() to work it's magic.
  permutations <- permute( population, n-1 )
  names( permutations ) <- permutations

  frequencies <- ldply( permutations, countFrequency,
    tuples = tuples, population = population )

  return( frequencies )

}

And there you go:

require( plyr )
permutationFrequency( s, 2 )
  .id 1 0
1   1 2 3
2   0 2 7

permutationFrequency( s, 3 )
  .id 1 0
1  11 1 1
2  01 1 1
3  10 0 3
4  00 2 4

permutationFrequency( s, 4 )
  .id 1 0
1 111 0 1
2 011 1 0
3 101 0 0
4 001 1 1
5 110 0 1
6 010 0 1
7 100 0 2
8 000 2 2

permutationFrequency( sample( -1:1, 10, replace = T ), 2 )
  .id 1 -1 0
1   1 1  2 0
2  -1 0  1 2
3   0 1  0 2

Apologies to my stochastic processes teacher, but functional programming problems in R were just more interesting than the Gambler's Ruin today...

Sharpie 2010-02-17 20:38:50

Thanks very much for this, but the .id column appears to be missing in my output. Or am I missing something? The rest is exactly what I needed.

knot 2010-02-18 01:30:39

Hmm, I noticed the `.id` column didn't show up if I gave an unnamed list or vector to `ldply()`. Did you include `names(permutations) <- permutations`?

Sharpie 2010-02-18 01:40:13

Yes, to start with, I copypasted your code.

knot 2010-02-18 08:26:47

Interesting. Could be a version thing-- I'm using R 2.10.1 and plyr 0.1.9

Sharpie 2010-02-18 09:31:44

SessionInfo() informed I used plyr 0.1.3, and update.packages() did not help. But upgrading from R 2.9.2 did help :)

knot 2010-02-19 00:11:25

Right on! Glad it works!

Sharpie 2010-02-19 01:55:18

Answer 2

+1 A:

One approach is to create a data frame of the subsequences and then use the table function:

s<-c(1,0,0,0,1,0,0,0,0,0,1,1,1,0,0)
n<-length(s)
k<-3
subseqs<-t(sapply(1:(n-k+1),function(i){s[i:(i+k-1)]}))
colnames(subseqs)<-paste('Y',1:k,sep="")
subseqs<-data.frame(subseqs)
table(subseqs)

This produces

Use ftable instead of table or on the output of table for a display similar to the one in your question:

ftable(subseqs)
          Y3 0 1
    Y1 Y2       
    0  0     4 2
       1     1 1
    1  0     3 0
       1     1 1

Jyotirmoy Bhattacharya 2010-02-18 09:13:15

ansaurus

tags:

views:

answers:

R only: Frequencies of all subsequences of size 3 in a given 0-1 sequnce?

related questions