tags:

views:

57

answers:

2

I am working on a small project which involves a dictionary based text searching within a collection of documents. My dictionary has positive signal words (a.k.a good words) but in the document collection just finding a word does not guarantee a positive result as there may be negative words for example (not, not significant) that may be in the proximity of these positive words. I want to construct a matrix such that it contains the document number,positive word and its proximity to negative words.

Can anyone please suggest a way to do that. My project is at a very very early stage so I am giving a basic example of my text.

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.   

This is my example document in which candesartan cilexetil, glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide are my positive words and no significant is my negative word. I want to do a proximity (word based) mapping between my positive and nevative words.

Can anyone give some helpful pointers?

+3  A: 

First of all I would suggest not to use R for this task. R is great for many things, but text manipulation is not one of those. Python could be a good alternative.

That said, if I were to implement this in R, I would probably do something like (very very rough):

# You will probably read these from an external file or a database
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide")
badWords <- c("no significant", "other drugs")

mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide."
mytext <- tolower(mytext) # Let's make life a little bit easier...

goodPos <- NULL
badPos <- NULL

# First we find the good words
for (w in goodWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    goodPos <- c(goodPos, pos)
    }

# And then the bad words
for (w in badWords)
    {
    pos <- regexpr(w, mytext)
    if (pos != -1)
        {
        cat(paste(w, "found at position", pos, "\n"))
        }
    else    
        {
        pos <- NA
        cat(paste(w, "not found\n"))
        }

    badPos <- c(badPos, pos)
    }

# Note that we use -badPos so that when can calculate the distance with rowSums
comb <- expand.grid(goodPos, -badPos)
wordcomb <- expand.grid(goodWords, badWords)
dst <- cbind(wordcomb, abs(rowSums(comb)))

mn <- which.min(dst[,3])
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))
nico
I almost got what I was looking for. Thanks nico!
Neo_Me
+3  A: 

Did you look at either one of the

Dirk Eddelbuettel
Nice packages, didn't know them! Still, I don't think R is the best tool to do this kind of analysis.
nico
Yes, I use the tm package very frequently!
Neo_Me