tags:

views:

62

answers:

1

Hi

I would like to write a regex using unix command that identifies all strings that do not confirm to the following format

First Leter is UpperCase    
Followed by any number of letters
Underscore
Followed by UpperCase Letter
Followed by any number of letters
Underscore
and so on .............

The number of underscores is variable

So valid ones are                                     Invalid ones are
Alpha_Beta_Gamma                                      alph_Beta_Gamma
Alpha_Beta_Gamma_Delta                                Alpha_beta_Gamma
Alppha_Beta                                           Alpha_beta
Aliph_Theta_Pi_Chi_Ming                               Alpha_theta_Pi_Chi_Ming
+3  A: 

grep has a -v option which inverts the match (ie. returns non-matching lines). The -E option puts grep into extended-regexp mode (which allows for + and parentheses to be unescaped in the pattern).

The pattern you can use is (broken up for clarity):

^              # beginning of string
  [A-Z]        # a single uppercase letter
  [a-z]*       # zero or more lowercase letters
  (            # start a group
    _          # an underscore
    [A-Z]      # a single uppercase letter
    [a-z]*     # zero or more lowercase letters
  )+           # close the group and it can appear one or more times
$              # end of string

So assuming you have a file test.dat that contains your 8 strings from your question:

grep -E -v "^[A-Z][a-z]*(_[A-Z][a-z]*)+$" test.dat

Returns:

alph_Beta_Gamma
Alpha_beta_Gamma
Alpha_beta
Alpha_theta_Pi_Chi_Ming
Daniel Vandersluis
Daniel Thank you very much
lisa
Thanks for the wunderful Explanation
lisa
The `+` after the `[a-z]` needs to be `*` to match `A_B_C`.
Omnifarious
@Omnifarious: Missed that requirement; I updated my pattern.
Daniel Vandersluis