ansaurus

Question

What Perl regex can match CamelCase words?

Answer 1

+13 A:

I think you want something like this, written with the \x flag to add comments and insignificant whitespace:

/
   \b      # word boundary so you don't start in the middle of a word

   (          # open grouping
      [A-Z]      # initial uppercase
      [a-z]*     # any number of lowercase letters
   )          # end grouping

   {2,}    # quantifier: at least 2 instances, unbounded max  

   \b      # word boundary
/x

If you want it without the fancy formatting, just remove the whitespace and comments:

/\b([A-Z][a-z]*){2,}\b/

As j_random_hacker points out, this is a bit simple since it will match a word that is just consecutive capital letters. His solution, which I've expanded with \x to show some detail, ensures at least one lowercase letter:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

I explain all of these features in Learning Perl.

brian d foy 2009-05-02 23:05:47

Isn't a single capitalized word (such as Perl or Boing) also a valid CamelCase word? In that case, the quantifier should be {1,} or simply +

Barry Brown 2009-05-02 23:16:53

@Barry: In many case, it would cause more problems than solve them. I like Brians' versions. @Brian: What does the flag /x mean which you do not use in your last command?

Masi 2009-05-03 00:08:29

Perl or Boing are not camel-cased because they are not compound words.

brian d foy 2009-05-03 00:27:11

You guys need to be more careful when you talk about something being camel case: do you mean ArabianCamelCase (also known as DromedaryCase, one word okay) or BactrianCamelCase (multiple words)?

Anon Guy 2009-05-03 00:42:31

Not to mention AliceTheCamelCase (also known as lowercase).

Anon Guy 2009-05-03 00:42:53

What about the third form, smallFirstLetter case? Isn't that also camel case? After all, no matter what kind of camel, the hump(s) are always in the middle, not at the ends.

AmbroseChapel 2009-05-03 01:16:32

@Ambrose: That's what I know camel case as.

sharth 2009-05-03 01:22:08

Note that this regex will also pick up words that consist of all capitals (depending on your precise definition of camel case, these words may or may not be considered camel cased). If you want to restrict to just camel cased words containing at least one lowercase letter, use: /\b([A-Z][a-z]*)+[A-Z][a-z]+([A-Z][a-z]*)*\b/

j_random_hacker 2009-05-03 08:23:17

Yeah, consecutive capital letters is a definition problem. If I were going over source code, I'd pick up those XXX I litter everywhere.

brian d foy 2009-05-03 11:58:27

I think somebody needs to make a Regexp::Common module to handle these cases.

Kent Fredric 2009-05-03 17:28:53

Answer 2

+4 A:

Assuming you aren't using the regex to do extraction, and just matching...

[A-Z][a-zA-Z]*

Isn't the only real requirement that it's all letters and starts with a capital letter?

sharth 2009-05-03 01:21:29

This is pretty much equivalent to Brian's regex except less complicated. You could detect words like HellotheRe, which obviously isn't correct CamelCase, but no regex can tell what is a word in there. Just put in the boundary marks and this should be good enough.

Unknown 2009-05-03 01:54:55

EDIT: I corrected your regex by capitalising the final "z".

j_random_hacker 2009-05-03 08:38:15

@j_random_hacker: whoops. Thanks for catching that.

sharth 2009-05-03 15:27:21

Answer 3

+4 A:

brian's and sharth's answers will also report words that consist entirely of uppercase letters (e.g. FOO). This may or may not be what you want. If you want to restrict to just camel-cased words that contain at least one lowercase letter, use:

/\b[A-Z][a-zA-Z]*[a-z][a-zA-Z]*\b/

If in addition you wish to exclude words that consist of a single uppercase letter followed by any number of lowercase letters (e.g. Perl), use:

/\b[A-Z][a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/

(Basically, we require the string to start with a capital letter and to contain at least one additional capital letter and one lowercase letter; these latter two can appear in either order.)

j_random_hacker 2009-05-03 08:36:55

Your first example matches things that aren't compound words, like "Foo". The second one is a bit hairy for early morning golfing. :)

brian d foy 2009-05-03 12:07:25

@brian: As you know, with regexes it's often a case of "some hair required." :) I hope it's clear from the 2nd body text paragraph that the 1st regex will match "Foo" et al. (since the purpose of the 2nd regex is specifically to exclude those matches).

j_random_hacker 2009-05-03 14:28:47

Answer 4

A:

How about this one: /\b[A-Z]([a-z]+[A-Z]?)*\b/ ??

Jagmal 2009-07-15 20:29:48

What is the key difference of your code to Brian`s?

Masi 2009-07-15 22:02:14

ansaurus

tags:

views:

answers:

What Perl regex can match CamelCase words?

related questions