tags:

views:

1068

answers:

4

I am searching the following words in .todo files:

ZshTabCompletionBackward 
MacTerminalIterm

I made the following regex

[A-Z]{1}[a-z]*[A-Z]{1}[a-z]*

However, it is not enough, since it finds only the following type of words

ZshTab

In pseudo code, I am trying to make the following regex

([A-Z]{1}[a-z]*[A-Z]{1}[a-z]*){1-9}

How can you make the above regex in Perl?

+13  A: 

I think you want something like this, written with the \x flag to add comments and insignificant whitespace:

/
   \b      # word boundary so you don't start in the middle of a word

   (          # open grouping
      [A-Z]      # initial uppercase
      [a-z]*     # any number of lowercase letters
   )          # end grouping

   {2,}    # quantifier: at least 2 instances, unbounded max  

   \b      # word boundary
/x

If you want it without the fancy formatting, just remove the whitespace and comments:

/\b([A-Z][a-z]*){2,}\b/

As j_random_hacker points out, this is a bit simple since it will match a word that is just consecutive capital letters. His solution, which I've expanded with \x to show some detail, ensures at least one lowercase letter:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

I explain all of these features in Learning Perl.

brian d foy
Isn't a single capitalized word (such as Perl or Boing) also a valid CamelCase word? In that case, the quantifier should be {1,} or simply +
Barry Brown
@Barry: In many case, it would cause more problems than solve them. I like Brians' versions. @Brian: What does the flag /x mean which you do not use in your last command?
Masi
Perl or Boing are not camel-cased because they are not compound words.
brian d foy
You guys need to be more careful when you talk about something being camel case: do you mean ArabianCamelCase (also known as DromedaryCase, one word okay) or BactrianCamelCase (multiple words)?
Anon Guy
Not to mention AliceTheCamelCase (also known as lowercase).
Anon Guy
What about the third form, smallFirstLetter case? Isn't that also camel case? After all, no matter what kind of camel, the hump(s) are always in the middle, not at the ends.
AmbroseChapel
@Ambrose: That's what I know camel case as.
sharth
Note that this regex will also pick up words that consist of all capitals (depending on your precise definition of camel case, these words may or may not be considered camel cased). If you want to restrict to just camel cased words containing at least one lowercase letter, use: /\b([A-Z][a-z]*)+[A-Z][a-z]+([A-Z][a-z]*)*\b/
j_random_hacker
Yeah, consecutive capital letters is a definition problem. If I were going over source code, I'd pick up those XXX I litter everywhere.
brian d foy
I think somebody needs to make a Regexp::Common module to handle these cases.
Kent Fredric
+4  A: 

Assuming you aren't using the regex to do extraction, and just matching...

[A-Z][a-zA-Z]*

Isn't the only real requirement that it's all letters and starts with a capital letter?

sharth
This is pretty much equivalent to Brian's regex except less complicated. You could detect words like HellotheRe, which obviously isn't correct CamelCase, but no regex can tell what is a word in there. Just put in the boundary marks and this should be good enough.
Unknown
EDIT: I corrected your regex by capitalising the final "z".
j_random_hacker
@j_random_hacker: whoops. Thanks for catching that.
sharth
+4  A: 

brian's and sharth's answers will also report words that consist entirely of uppercase letters (e.g. FOO). This may or may not be what you want. If you want to restrict to just camel-cased words that contain at least one lowercase letter, use:

/\b[A-Z][a-zA-Z]*[a-z][a-zA-Z]*\b/

If in addition you wish to exclude words that consist of a single uppercase letter followed by any number of lowercase letters (e.g. Perl), use:

/\b[A-Z][a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/

(Basically, we require the string to start with a capital letter and to contain at least one additional capital letter and one lowercase letter; these latter two can appear in either order.)

j_random_hacker
Your first example matches things that aren't compound words, like "Foo". The second one is a bit hairy for early morning golfing. :)
brian d foy
@brian: As you know, with regexes it's often a case of "some hair required." :) I hope it's clear from the 2nd body text paragraph that the 1st regex will match "Foo" et al. (since the purpose of the 2nd regex is specifically to exclude those matches).
j_random_hacker
A: 

How about this one: /\b[A-Z]([a-z]+[A-Z]?)*\b/ ??

Jagmal
What is the key difference of your code to Brian`s?
Masi