tags:

views:

29

answers:

2

So I am trying to incorporate a hashtag search into posts on my application. I have come up with a few close Regular Expressions, but none seem to work. Let's say I have a string:

#i keep #hashtags in my mom's attic. She says I am her number #1 son. Why I'm not num#ber #2: http://pics.com/pic#thepic

I would want the RegEx to match:

  • #i
  • #hashtags
  • #1
  • #2

I want it to specifically NOT match HTML entities, #anchors in URLs, hashtags that start in the middle of a word, and any trailing punctuation after a hashtag. I also want to make sure it matches a hashtag at the beginning of the string, as well as hashtags that are wholly numeric. I want it to allow hyphens and underscores in hashtags as well. This is for a MySQL query, so Lookaheads and Lookbacks won't work.

The current one I'm stuck on is:

#([A-Za-z0-9_]+)

But it's not really doing the job. RegEx is an area in which my knowledge is a bit lacking. Thanks for any help!

A: 

I think you're stuck with testing it in 3 rounds:

`txt` REGEXP '^#[0-9a-zA-Z]+[: ,]';
OR `txt` REGEXP '[: ,]#[0-9a-zA-Z]+[: ,]';
OR `txt` REGEXP '[: ,]#[0-9a-zA-Z]+$';

or you could of course do

CONCAT(" ",`txt`," ") REGEXP '[: ,]#[0-9a-zA-Z]+[: ,]';

either way, it will get messy. Of course the [: ,] may be expanded to what you think is not part of things like '...

mvds
+1  A: 

The following matches all the examples you want matched, and rejects all the ones you don't want matched:

WHERE mycolumn REGEXP '^#[[:alnum:]]' OR mycolumn REGEXP ' #[[:alnum:]]'

You don't have to match the whole hashtag, since SQL won't return that string anyway, it just returns 1 (match) or 0 (no match). To extract the hashtag string, you'll have to use some other string manipulation in you application programming language.

Bill Karwin