tags:

views:

507

answers:

4

I'm trying to find all the hash tags in a string. The hashtags are from a stream like twitter, they could be anywhere in the text like:

this is a #awesome event, lets use the tag #fun

I'm using the .NET framework (c#), I was thinking this would be a suitable regex pattern to use:

#\w+

Is this the best regex for this purpose?

+2  A: 

It depends on whether you want to match hashtags inside other strings ("Some#Word") or things that probably aren't hashtags ("We're #1"). The regex you gave #\w+ will match in both these cases. If you slightly modify your regex to \b#\w\w+, you can eliminate these cases and only match hashtags of length greater than 1 on word boundaries.

bobbymcr
Thanks for that, I was a little worried the edge cases would cause me some grief.
David
Another note, this regex won't match "#tags-with-hyphens", so keep that in mind...
bobbymcr
Maybe `\b[^ .,)\]}]` would be a better choice. But that still requires a word character (letter/number, iirc) at the beginning for `\b` to work. I have absolutely no clue how "hashtags" are used on Twitter, though. Might be that I'm gravely mistaken here and that they regularly include punctuation except hyphens.
Joey
`\b#` will only match if the `#` **is** immediately preceded by a word character. If anything, you want the opposite: `\B#` (`\B` == "a position that is not a word boundary").
Alan Moore
A: 

this is the one i wrote it looks for word boundaries and only matches hast text (?<=#)\w*?(?=\W)

Carter Cole
A: 

I tweeted a string with randomly placed hash tags, saw what Twitter did with it, and then tried to match it with a regular expression. Here's what I got:

\B#\w*[a-zA-Z]+\w*

#face *#Fa!ce something #iam#1 #1 #919 #jifdosaj somethin#idfsjoa 9#9#98 9#9f9j#9jlasdjl #jklfdsajl34 #34239 #jkf *#a *#1j3rj3

go minimal
A: 

I've tested some tweets, and realized that hashtags:

  • Are composed by alphanumeric characters plus underscore.
  • Must have at least 1 letter or underscore.
  • May have the dot character, but the hashtag will be interpreted as a link to an external site. (I do not consider this)

So, that's what I've got:

\B#(\w*[A-Za-z_]+\w*)
Gabriel Magno