tags:

views:

126

answers:

2

I'd like to write an R function that accepts a formula as its first argument, similar to lm() or glm() and friends. In this case, it's a function that takes a data frame and writes out a file in SVMLight format, which has this general form:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float> 
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>

for example, the following data frame:

  result qid     f1     f2     f3     f4   f5     f6     f7     f8
1     -1   1 0.0000 0.1253 0.0000 0.1017 0.00 0.0000 0.0000 0.9999
2     -1   1 0.0098 0.0000 0.0000 0.0000 0.00 0.0316 0.0000 0.3661
3      1   1 0.0000 0.0000 0.1941 0.0000 0.00 0.0000 0.0509 0.0000
4     -1   2 0.0000 0.2863 0.0948 0.0000 0.34 0.0000 0.7428 0.0608
5      1   2 0.0000 0.0000 0.0000 0.4347 0.00 0.0000 0.9539 0.0000
6      1   2 0.0000 0.7282 0.9087 0.0000 0.00 0.0000 0.0000 0.0355

would be represented as follows:

-1 qid:1 2:0.1253 4:0.1017 8:0.9999
-1 qid:1 1:0.0098 6:0.0316 8:0.3661
1  qid:1 3:0.1941 7:0.0509
-1 qid:2 2:0.2863 3:0.0948 5:0.3400 7:0.7428 8:0.0608
1  qid:2 4:0.4347 7:0.9539
1  qid:2 2:0.7282 3:0.9087 8:0.0355

The function I'd like to write would be called something like this:

write.svmlight(result ~ f1+f2+f3+f4+f5+f6+f7+f8 | qid, data=mydata, file="out.txt")

Or even

write.svmlight(result ~ . | qid, data=mydata, file="out.txt")

But I can't figure out how to use model.matrix() and/or model.frame() to know what columns it's supposed to write. Are these the right things to be looking at?

Any help much appreciated!

A: 

Hi Ken.

I used

formu.names <- all.vars(formu)
    Y.name <- formu.names[1]
    X.name <- formu.names[2]
    block.name <- formu.names[3]

In the code I wrote about doing a post-hoc for a friedman test:

http://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/

But it will only work for: Y`X|block

I hope for a better answer others will give.

Cheers,

Tal

Tal Galili
+1  A: 

Partial answer. You can subscript a formula object to get a parse tree of the formula:

> f<-a~b+c|d
> f[[1]]
`~`
> f[[2]]
a
> f[[3]]
b + c | d
> f[[3]][[1]]
`|`
> f[[3]][[2]]
b + c
> f[[3]][[3]]
d

Now all you need is code to walk this tree.

UPDATE: Here's is an example of a function that walks the tree.

walker<-function(formu){
  if (!is(formu,"formula"))
    stop("Want formula")
  lhs <- formu[[2]]
  formu <- formu[[3]]

  if (formu[[1]]!='|')
    stop("Want conditional part")

  condi <- formu[[3]]

  flattener <- function(f) {if (length(f)<3) return(f);
                            c(Recall(f[[2]]),Recall(f[[3]]))}
  vars <- flattener(formu[[2]])

  list(lhs=lhs,condi=condi,vars=vars)
}

walker(y~a+b|c)

Also look at the documentation for terms.formula and terms.object. Looking at the code for some functions that take conditional formulas can help, for eg. the lmer function in lme4 package.

Jyotirmoy Bhattacharya
Heh - yeah, walking the tree is indeed "all I need". =)I did have a look at lmer(), but I'd need to grab the source package because its `lmerFrames()` method, which seems to do the parsing, is hidden. Or is there a way to see the code for S4 methods like this?
Ken Williams
OK, looking at the source for `lmerFrames()` etc. now. How yucky. I guess I always assumed there was some slicker way to manipulate formulae.
Ken Williams
You can refer to non-exported functions in namespaces by using :::, eg. lme4:::lmerFrames. Updated my answer to include a function to walk the tree, does not do much input checking at the moment.
Jyotirmoy Bhattacharya