ansaurus

Question

Regex to match anything (including the empty string) except a specific given string.

Answer 1

+2 A:

This should work, assuming look-ahead assertions are allowed in MySQL regexes.

/Kansas(?! State)/

Edit: OK, this is super ugly, but it works for me in Perl and doesn't use a look-ahead assertion:

/Kansas(([^ ]|$)| (([^S]|$)|S(([^t]|$)|t(([^a]|$)|a(([^t]|$)|t([^e]|$))))))/

Kip 2010-05-14 20:12:54

Thanks Kip, I was just editing the question to this effect when you added this answer!

dreeves 2010-05-14 20:16:31

great answer, except that MySQL doesn't implement look-ahead assertions.

David M 2010-05-14 20:33:49

Re: Edit: Wow, I'm impressed! :) Unfortunately MySQL responds to that with "Illegal variable name."

dreeves 2010-05-14 20:42:02

@Kip, you basically constructed a (regular) regex that matches the complement language of " State". Since the complement of a regular language ist regular (see [Regular language](http://en.wikipedia.org/wiki/Regular_language#Closure_properties)), this is always possible, albeit usually ugly.

Christian Semrau 2010-05-14 20:43:57

@dreeves: I just tried this on a MySQL database and it worked: `SELECT 'This is Kansas State' REGEXP 'Kansas(([^ ]|$)| (([^S]|$)|S(([^t]|$)|t(([^a]|$)|a(([^t]|$)|t([^e]|$))))))' as REGEXRESULT`

Kip 2010-05-14 20:44:50

@Kip: I just tested it, it works here too. +1

Mark Byers 2010-05-14 20:57:58

It is important to note that Perl and MySQL have completely different regular expression implementations. For straightforward examples, like the one that Kip and I implemented separately, they should perform very much the same. However, don't count on Perl and MySQL to give the same answers or to similar performance.

David M 2010-05-14 20:59:58

@Kip: Ah, you're right. I was doing it with the -e switch on the command line. It does work as you describe at the MySQL prompt itself. And I've now confirmed that your regex matches perl's 'Kansas(?! State)' for the data set I'm working with. Nice! It's a little painful to generalize your solution though! :)

dreeves 2010-05-14 21:01:52

Thanks everyone. Super helpful. Seems like MySQL should get with the program and just implement pcre, eh?

dreeves 2010-05-14 21:10:19

Answer 2

+4 A:

MySQL doesn't have lookaheads. A workaround is to make two tests:

WHERE yourcolumn LIKE '%Kansas%'
  AND yourcolumn NOT LIKE '%Kansas State%'

I used LIKE here instead of RLIKE because once you split it up like this, regular expressions are no longer required. However if you still need regular expressions for other reasons you can still use this same technique.

Note that this does not match 'Kansas Kansas State' as you requested.

Update: If matching 'Kansas Kansas State' is that important then you can use this ugly regular expression that is supported by MySQL:

'Kansas($|[^ ]| ($|[^S])| S($|[^t])| St($|[^a])| Sta($|[^t])| Stat($|[^e]))'

Oops: I just noticed Kip already updated his comment with a solution very similar to this.

Mark Byers 2010-05-14 20:24:18

Thanks Mark. The 'Kansas Kansas State' issue is the crux though. See the comment I added to the question about this.

dreeves 2010-05-14 20:31:14

Answer 3

+1 A:

This is ugly, but here you go:

You might not need to expand the regex all the way to the end, depending on whether your input might include something like 'I need to get this man to surgery in Kansas Stat!'

mysql> select x,x RLIKE 'Kansas($|[^ ]| ($|[^S])| S($|[^t])| St($|[^a])| Sta($|[^t])| Stat($|[^e]))' AS result from examples;
+------------------------+--------+
| x                      | result |
+------------------------+--------+
| I am from Kansas       |      1 |
| Kansas State is great  |      0 |
| Kansas is a state      |      1 |
| Kansas Kansas State    |      1 |
| Kansas State vs Kansas |      1 |
| I'm from Kansas State  |      0 |
| KansasState            |      1 |
+------------------------+--------+
7 rows in set (0.00 sec)

David M 2010-05-14 20:56:42

Ha! Thanks! And I love how you demonstrated that it works on all the examples!

dreeves 2010-05-14 21:16:45

Answer 4

+1 A:

More efficient than that large regex (depending, of course, on your data and the quality of the engine) is

WHERE col LIKE '%Kansas%' AND
  (col NOT LIKE '%Kansas State%' OR
  REPLACE(col, 'Kansas State', '') LIKE '%Kansas%')

If Kansas usually appears in the form 'Kansas State', though, you may find this better:

WHERE col LIKE '%Kansas%' AND
  REPLACE(col, 'Kansas State', '') LIKE '%Kansas%'

This has the added advantage of being easier to maintain. It works less well if Kansas is common and text fields are large. Of course you can test these on your own data and tell us how they compare.

Charles 2010-05-21 04:44:44

Wow, yes, this seems much better. (Performance was not an issue in my case but this is just much easier to type, not to mention to generalize.) Thank you!

dreeves 2010-05-21 18:59:55

ansaurus

tags:

views:

answers:

Regex to match anything (including the empty string) except a specific given string.

related questions