views:

190

answers:

4

I'd like to test whether a string contains "Kansas" followed by anything other than " State".

Examples:

"I am from Kansas"          true
"Kansas State is great"     false
"Kansas is a state"         true
"Kansas Kansas State"       true
"Kansas State vs Kansas"    true
"I'm from Kansas State"     false
"KansasState"               true

For PCRE, I believe the answer is this:

'Kansas(?! State)'

But Mysql's REGEXP doesn't seem to like that.

ADDENDUM: Thanks to David M for generalizing this question: http://stackoverflow.com/questions/2837706/how-to-convert-a-pcre-to-a-posix-re

+2  A: 

This should work, assuming look-ahead assertions are allowed in MySQL regexes.

/Kansas(?! State)/

Edit: OK, this is super ugly, but it works for me in Perl and doesn't use a look-ahead assertion:

/Kansas(([^ ]|$)| (([^S]|$)|S(([^t]|$)|t(([^a]|$)|a(([^t]|$)|t([^e]|$))))))/
Kip
Thanks Kip, I was just editing the question to this effect when you added this answer!
dreeves
great answer, except that MySQL doesn't implement look-ahead assertions.
David M
Re: Edit: Wow, I'm impressed! :) Unfortunately MySQL responds to that with "Illegal variable name."
dreeves
@Kip, you basically constructed a (regular) regex that matches the complement language of " State". Since the complement of a regular language ist regular (see [Regular language](http://en.wikipedia.org/wiki/Regular_language#Closure_properties)), this is always possible, albeit usually ugly.
Christian Semrau
@dreeves: I just tried this on a MySQL database and it worked: `SELECT 'This is Kansas State' REGEXP 'Kansas(([^ ]|$)| (([^S]|$)|S(([^t]|$)|t(([^a]|$)|a(([^t]|$)|t([^e]|$))))))' as REGEXRESULT`
Kip
@Kip: I just tested it, it works here too. +1
Mark Byers
It is important to note that Perl and MySQL have completely different regular expression implementations. For straightforward examples, like the one that Kip and I implemented separately, they should perform very much the same. However, don't count on Perl and MySQL to give the same answers or to similar performance.
David M
@Kip: Ah, you're right. I was doing it with the -e switch on the command line. It does work as you describe at the MySQL prompt itself. And I've now confirmed that your regex matches perl's 'Kansas(?! State)' for the data set I'm working with. Nice! It's a little painful to generalize your solution though! :)
dreeves
Thanks everyone. Super helpful. Seems like MySQL should get with the program and just implement pcre, eh?
dreeves
+4  A: 

MySQL doesn't have lookaheads. A workaround is to make two tests:

WHERE yourcolumn LIKE '%Kansas%'
  AND yourcolumn NOT LIKE '%Kansas State%'

I used LIKE here instead of RLIKE because once you split it up like this, regular expressions are no longer required. However if you still need regular expressions for other reasons you can still use this same technique.

Note that this does not match 'Kansas Kansas State' as you requested.

Update: If matching 'Kansas Kansas State' is that important then you can use this ugly regular expression that is supported by MySQL:

'Kansas($|[^ ]| ($|[^S])| S($|[^t])| St($|[^a])| Sta($|[^t])| Stat($|[^e]))'

Oops: I just noticed Kip already updated his comment with a solution very similar to this.

Mark Byers
Thanks Mark. The 'Kansas Kansas State' issue is the crux though. See the comment I added to the question about this.
dreeves
+1  A: 

This is ugly, but here you go:

You might not need to expand the regex all the way to the end, depending on whether your input might include something like 'I need to get this man to surgery in Kansas Stat!'

mysql> select x,x RLIKE 'Kansas($|[^ ]| ($|[^S])| S($|[^t])| St($|[^a])| Sta($|[^t])| Stat($|[^e]))' AS result from examples;
+------------------------+--------+
| x                      | result |
+------------------------+--------+
| I am from Kansas       |      1 |
| Kansas State is great  |      0 |
| Kansas is a state      |      1 |
| Kansas Kansas State    |      1 |
| Kansas State vs Kansas |      1 |
| I'm from Kansas State  |      0 |
| KansasState            |      1 |
+------------------------+--------+
7 rows in set (0.00 sec)
David M
Ha! Thanks! And I love how you demonstrated that it works on all the examples!
dreeves
+1  A: 

More efficient than that large regex (depending, of course, on your data and the quality of the engine) is

WHERE col LIKE '%Kansas%' AND
  (col NOT LIKE '%Kansas State%' OR
  REPLACE(col, 'Kansas State', '') LIKE '%Kansas%')

If Kansas usually appears in the form 'Kansas State', though, you may find this better:

WHERE col LIKE '%Kansas%' AND
  REPLACE(col, 'Kansas State', '') LIKE '%Kansas%'

This has the added advantage of being easier to maintain. It works less well if Kansas is common and text fields are large. Of course you can test these on your own data and tell us how they compare.

Charles
Wow, yes, this seems much better. (Performance was not an issue in my case but this is just much easier to type, not to mention to generalize.) Thank you!
dreeves