views:

63

answers:

1

Currently we're using javascript new RegExp('#[^,#=!\s][^,#=!\s]*') (see [1]) and it mostly works, except that it also matches URLs with anchors like http://this.is/no#hashtag and also we'd rather avoid matching foo#bar

Some attempts have been made with look-ahead but it doesn't seem to work, or that I just don't get it.

With the below source text:

#public #writable #kommentarer-till-beta -- all these should be matched
Verkligen #bra jobbat! T ex #kommentarer till #artiklar och #blogginlägg, kool. -- mixed within text
http://this.is/no#hashtag -- problem
xxy#bar      -- We'd prefer not matching this one, and...
#foo=bar   =foo#bar  -- we probably shouldn't match any of those either.
#foo,bar #foo;bar #foo-bar #foo:bar   -- We're flexible on whether these get matched in part or in full

.

We'd like to get below output:

(showing $ instead of <a class=tag href=.....>...</a> for readability reasons)

$ $ $ -- all these should be matched
Verkligen $ jobbat! T ex $ till $ och $, kool. -- mixed within text
http://this.is/no$ -- problem
xxy$      -- We'd prefer not matching this one, and...
$=bar   =foo$  -- we probably shouldn't match any of those either.
$,bar $ $ $   -- We're flexible on whether these get matched in part or in full

[1] http://github.com/ether/pad/blob/master/etherpad/src/plugins/twitterStyleTags/hooks.js

+1  A: 

I believe looking for word boundaries would do the trick here (or, a lack thereof, apparently - which seems rather counterintuitive to me).

\B#[^,#=!\s]+ doesn't match anything on the third or fourth line. However, it DOES match the #foo in #foo=bar, and everything else covered by the $ signs in your example.

EDIT: After a little fiddling around, \B#[^,#=!\s]+[\s,] will match everything on the first and second lines. Nothing is matched on lines 3-5, and on line 6, everything except #foo,bar is matched in full (#foo,bar only has a match on the part before the comma.

You'll likely want a capturing group to leave out the whitespace or comma at the end, so that'd be \B(#[^,#=!\s]+)[\s,].

(If you really want all of the tags on line 6 to be matched in full, remove the comma from the first of the character classes.)

Note that you might need something more in there for perfect coverage, but this at least meets your current test cases.

Michael Madsen