views:

189

answers:

5

Update: As per comments regarding the ambiguity of my question, I've increased the detail in the question.

(Terminology: by words I am refering to any succession of alphanumerical characters.)

I'm looking for a regex to match the following, verbatim:

  • Words.
  • Words with one apostrophe at the beginning.
  • Words with any number of non-contiguous apostrophe throughout the middle.
  • Words with one apostrophe at the end.

I would like to match the following, however not verbatim, rather, removing the apostrophes:

  • Words with an apostrophe at the beginning and at the end would be matched to the word, without the apostrophes. So 'foo' would be matched to foo.
  • Words with more than one contiguous apostrophe in the middle would be resolved to two different words: the fragment before the contiguous apostrophes and the fragment after the contiguous apostrophes. So, foo''bar would be matched to foo and bar.
  • Words with more than one contiguous apostrophe at the beginning or at the end would be matched to the word, without the apostrophes. So, ''foo would be matched to foo and ''foo'' to foo.

Examples These would be matched verbatim:

  • 'bout
  • it's
  • persons'

But these would be ignored:

  • '
  • ''

And, for 'open', open would be matched.

+1  A: 
/('\w+)|(\w+'\w+)|(\w+')|(\w+)/
  • '\w+ Matches a ' followed by one or more alpha characters, OR
  • \w+'\w+ Matche sone or more alpha characters followed by a ' followed by one or more alpha characters, OR
  • \w+' Matches one or more alpha characters followed by a '
  • \w+ Matches one or more alpha characters
WhirlWind
Returns `'a` for `'a'` (should return `a`) but otherwise great.
Beau Martínez
+1  A: 

Try using this:

(?=.*\w)^(\w|')+$

'bout     # pass
it's      # pass
persons'  # pass
'         # fail
''        # fail

Regex Explanation

NODE      EXPLANATION
  (?=       look ahead to see if there is:
    .*        any character except \n (0 or more times
              (matching the most amount possible))
    \w        word characters (a-z, A-Z, 0-9, _)
  )         end of look-ahead
  ^         the beginning of the string
  (         group and capture to \1 (1 or more times
            (matching the most amount possible)):
    \w        word characters (a-z, A-Z, 0-9, _)
   |         OR
    '         '\''
  )+        end of \1 (NOTE: because you're using a
            quantifier on this capture, only the LAST
            repetition of the captured pattern will be
            stored in \1)
  $         before an optional \n, and the end of the
            string
macek
Cheers on the explanation too.
Beau Martínez
A: 

How about this?

'?\b[0-9A-Za-z']+\b'?

EDIT: the previous version doesn't include apostrophes on the sides.

shinkou
@shinkou, this matches `'` and `''` as well. The OP doesn't want it to match if there are not letters present.
macek
@smotchkkiss: are you sure it match ' or ''? Just tested it and seems to work well
Claudio Redi
hey, my fault the down vote. Was playing with the arrows, could you edit your post so I can revert it?
Claudio Redi
@Claudio It's ok. I just wanted to share what I know.
shinkou
no, but I didn't mean to downvote you, it was an accident. But now it doesn't allow me to revert it. If you edit your post I would be able to do it. Sorry :-(
Claudio Redi
@shinkou, the edit is nice. Per the OP's updated question you might want to change the `'?` expressions to `'*` to match `''foo`, or `bar''`.
macek
+1  A: 

I submitted this 2nd answer coz it looks like the question has changed quite a bit and my previous answer is no longer valid. Anyway, if all conditions are listed up, try this:

(((?<!')')?\b[0-9A-Za-z]+\b('(?!'))?|\b[0-9A-Za-z]+('[0-9A-Za-z]+)*\b)
shinkou
A: 

This works fine

 ('*)(?:'')*('?(?:\w+'?)+\w+('\b|'?[^']))(\1)

on this data no problem

    'bou
    it's
    persons'
    'open'
    open
    foo''bar
    ''foo
    bee''
    ''foo''
    '
    ''

on this data you should strip result (remove spaces from matches)

    'bou it's persons' 'open' open foo''bar ''foo ''foo'' ' ''

(tested in The Regulator, results in $2)

Vojtech R.