tags:

views:

89

answers:

2

I have millions of Keywords in a column labeled Keyword.text. Each factor or Keyword can contains multiple words (or shall we say token). Here is an example with 4 keywords

Keyword.text
The quick brown fox the
.8 .crazy lazy dog
dog
jumps over+the 9

I'd like to count the number of tokens in each Keyword, so as to obtain:

Keyword.length
5
4
1
4

I installed the Tau package but I haven't gotten very far...

 textcnt(Mydf$Keyword.text, split = "[[:space:][:punct:]]+", method = "string", n = 1L)

returns an error I don't understand. Maybe it's due to having factors; it worked fine when practicing with a string.

I know how to do it in excel, but it doesn't work for the last line. If A2 has the keywords then: =LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1 would do

A: 

Please show the error.

Also try textcnt(as character(Mydf$Keyword.txt), split, ....) to force character mode.

Or load your data with 'stringsAsFactors=FALSE` -- the same question has come up here before.

Dirk Eddelbuettel
won't loading the data that way slow performance? I have a really huge file (1.8gig)
datayoda
So make a copy of the file: retain only the first few hundred lines and debug that way.
Dirk Eddelbuettel
Ok as.character fixes that, but getting the count per row is still non-obvious to me. I can get the total sum of words in the whole corpus, but not per row...
datayoda
One step at a time. If you have a row, then the split approach should count words for the row (using the `as.character()`). So `apply()` that call it over all elements in the vector and you should get a vector of tokens per row.
Dirk Eddelbuettel
no can't do captain...I must be missing something...
datayoda
If it's factor I would go with `levels(Mydf$Keyword.txt)` instead of `as.character`. Could be time-saving if there is some repeatability.
Marek
+1  A: 

Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :

tt <- data.frame(
    a=rnorm(3),
    b=rnorm(3),
    c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
    stringsAsFactors=F
)
sapply(tt$c, function(n){
  length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})

To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.

A simple example using readLines :

con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)

On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().

Joris Meys
I'd really like to have a solution at the dataframe level rather than at the reading line per line level. I'm shocked that this problem does not have an easy solution...I still can't sum tokens per keywords...
datayoda
Your code does not provide the sum, aka 15 in your example. Any ideas?
datayoda
I think I'll just create a function since it works per line only.
datayoda
Myfun = function(n) { sum(sapply(Mydf$Keyword.Text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)) }
datayoda
@datayoda : at the dataframe level it's easy. Dirk gave you that solution. I generally construct my dataframe -after- doing all the string work. It's a matter of style I guess, but I like the functionality of scan and readLines for very big sets.
Joris Meys