tags:

views:

69

answers:

4

I'm trying to use regular expressions in R to find one or more phrases within a vector of long sentences (which I'll call x).

So, for example, this works fine for one phrase:

grep("(phrase 1)",x)

But this doesn't work for two (or more) phrases:

grep("(phrase 1)+(phrase 2)+",x)

As I would expect. As I read it, this last one should give me all matches in x for 1 or more phrase 1s, and 1 or more phrase 2's. But it returns nothing.

A: 

Hi Boris,

Full examples (e.g. with, you know, data ...) are always good.

The main key for regexps in R is to remember that there are three (!!) different engines. I tend to like the Perl regexps.

Next, it is important to remember that there are meta-character -- so if you want parens, you need to escape them.

With that, here is an example:

> txt <- c("The grey fox jumped", "The blue cat slept", "The sky was falling")
> grep("blue", txt)                       # finds sentence two
[1] 2
> grep("(grey|blue)", txt, perl=TRUE)     # finds one and two
[1] 1 2
> grep("(red|blue)", txt, perl=TRUE)      # finds only two (as it should)
[1] 2
> 

So with Perl regexps, you list alternatives inside parentheses, separated by a pipe symbol.

Dirk Eddelbuettel
Got it, but what I wanted "gray AND blue" (which in your example would match none). Or "(The gray) AND (jumped)" which would match sentence 1. The parens I want to use to isolate phrases.
bshor
Ahh -- that explanation is clearer, and Aniko showed you how to include 'fluff' in the middle.
Dirk Eddelbuettel
+1  A: 

You have to tell it to skip over any intervening characters:

grep("(phrase 1)+.*(phrase 2)+",x)

Also note that it will not reverse the order, so you might have to add that explicitly. Overall, it might be simpler to search each phrase separately (especially if there are more than two phrases), and then combine with intersect and union as you want to get overall results.

Aniko
This works! I see that you'd want to skip the characters in between. But I did want the reverse order matched, too.
bshor
Is this what you mean: sapply(phrase.list, function (x) grep (x,y)?
bshor
Followed by intersection of a list, found here: http://finzi.psych.upenn.edu/R/Rhelp02/archive/98525.html
bshor
+1  A: 

Another way

which(grepl("(phrase 1)+",x) & grepl("(phrase 2)+",x))
Marek
A: 

There's a way to do it with a single regex using lookaheads, though most regex engines will execute it pretty slowly:

> txt <- c("The grey fox jumped", "The blue cat slept", "The fox is grey", "The cat is grey")
> grep("(?=.*fox)(?=.*grey)", txt, perl=TRUE)
[1] 1 3
Ken Williams