ansaurus

Question

Counting words within factors

Answer 1

A:

Please show the error.

Also try textcnt(as character(Mydf$Keyword.txt), split, ....) to force character mode.

Or load your data with 'stringsAsFactors=FALSE` -- the same question has come up here before.

Dirk Eddelbuettel 2010-10-08 18:55:09

won't loading the data that way slow performance? I have a really huge file (1.8gig)

datayoda 2010-10-08 19:04:20

So make a copy of the file: retain only the first few hundred lines and debug that way.

Dirk Eddelbuettel 2010-10-08 19:14:27

Ok as.character fixes that, but getting the count per row is still non-obvious to me. I can get the total sum of words in the whole corpus, but not per row...

datayoda 2010-10-08 19:48:34

One step at a time. If you have a row, then the split approach should count words for the row (using the `as.character()`). So `apply()` that call it over all elements in the vector and you should get a vector of tokens per row.

Dirk Eddelbuettel 2010-10-08 19:52:48

no can't do captain...I must be missing something...

datayoda 2010-10-08 21:25:23

If it's factor I would go with `levels(Mydf$Keyword.txt)` instead of `as.character`. Could be time-saving if there is some repeatability.

Marek 2010-10-08 21:47:37

Answer 2

+1 A:

Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :

tt <- data.frame(
    a=rnorm(3),
    b=rnorm(3),
    c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
    stringsAsFactors=F
)
sapply(tt$c, function(n){
  length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})

To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.

A simple example using readLines :

con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)

On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().

Joris Meys 2010-10-08 23:50:26

I'd really like to have a solution at the dataframe level rather than at the reading line per line level. I'm shocked that this problem does not have an easy solution...I still can't sum tokens per keywords...

datayoda 2010-10-11 22:40:56

Your code does not provide the sum, aka 15 in your example. Any ideas?

datayoda 2010-10-11 22:43:16

I think I'll just create a function since it works per line only.

datayoda 2010-10-11 23:29:25

Myfun = function(n) { sum(sapply(Mydf$Keyword.Text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)) }

datayoda 2010-10-11 23:29:45

@datayoda : at the dataframe level it's easy. Dirk gave you that solution. I generally construct my dataframe -after- doing all the string work. It's a matter of style I guess, but I like the functionality of scan and readLines for very big sets.

Joris Meys 2010-10-12 08:27:07

ansaurus

tags:

views:

answers:

Counting words within factors

related questions