views:

222

answers:

5

Suppose I have strings like the following :

OneTwo
ThreeFour
AnotherString
DVDPlayer
CDPlayer

I know how to tokenize the camel-case ones, except the "DVDPlayer" and "CDPlayer". I know I could tokenize them manually, but maybe you can show me a regex that can handle all the cases?

EDIT: the expected tokens are :

OneTwo -> One Two
...
CDPlayer -> CD Player
DVDPlayer -> DVD Player
A: 

Try a non-greedy look ahead. A token would be one or more uppercase characters followed by zero or more lowercase characters. The token would terminate when the next two character are an upper case and lower case - matching this section is what the non-greedy matching can be used. This approach has limitation but it should work for the examples you provided.

Benedict Cohen
+1 because you got there first -- though I guess an example might have pushed you up the "helpful" ranking :)
chrispy
+4  A: 

Look at my answer on the question, .NET - How can you split a “caps” delimited string into an array?.

The regex looks like this:

/([A-Z]+(?=$|[A-Z][a-z])|[A-Z]?[a-z]+)/g

It can be modified slightly to allow searching for camel-cased tokens, by replacing the $ with \b:

/([A-Z]+(?=\b|[A-Z][a-z])|[A-Z]?[a-z]+)/g
MizardX
The latter is almost equivalent to Gumbo's answer. The only difference is that this also accept words starting with lower-case. "camelCase" -> ["camel", "Case"]
MizardX
+4  A: 

Try this regular expression:

[A-Z](?:[a-z]+|[A-Z]*?(?=[A-Z][a-z]|\b))
Gumbo
+1  A: 

The regex

([A-Z]+[a-z]*)([A-Z][a-z]*)

would do what you want assuming that all your strings are 2 words long and the second word is not like DVD.

I.e. it would work for your examples but maybe not for what you are actually trying to do.

JonahSan
+1  A: 

Here's my attempt:

([A-Z][a-z]+)|([A-Z]+(?=[A-Z][a-z]+))
Richard Nienaber