ansaurus

Question

Split a PascalCase string into separate words

Answer 1

A:

var regex = new Regex("([A-Z]+[^A-Z]+)");
var matches = regex.Matches("aCamelCaseWord")
    .Cast<Match>()
    .Select(match => match.Value);
foreach (var element in matches)
{
    Console.WriteLine(element);
}

Prints

Camel
Case
Word

(As you can see, it doesn't handle camelCase - it dropped the leading "a".)

Pat 2010-07-09 19:54:41

1) Compile the regexp for some speed. 2) It'll still be slower than doing it by hand.

Steven Sudit 2010-07-09 19:55:23

@Steven I agree that it should be compiled for speed, but it's the functionality I'm going after for now. What do you mean it will be "slower than doing it by hand"? If I reflect over an object with a bunch of public properties and convert the names from PascalCase to separate words, it will be much faster (development and maintenance time) doing it programmatically than by hand.

Pat 2010-07-09 20:00:35

I didn't see speed mentioned as a requirement. Also I think "doing it by hand" refers to writing your own string parsing code which *may* be faster but *will* be significantly more code and more testing.

Ron Warholic 2010-07-09 20:05:01

Where'd the "a" go?

Ken Bloom 2010-07-09 20:07:15

@Ken This method doesn't handle camelCase, so the "a" was dropped (see edit to the answer).

Pat 2010-07-09 20:14:11

@Pat: what Ron said is correct: "by hand" means writing your own code to loop over the string, character by character, building up each word into a StringBuilder and outputting as needed.

Steven Sudit 2010-07-09 20:19:51

Answer 2

A:

Check that a non-word character comes at the beginning of your regex with \W and keep the individual strings together, then split the words.

Something like: \W([A-Z][A-Za-z]+)+

For: sdcsds sd aCamelCaseWord as dasd as aSscdcacdcdc PascelCase DfsadSsdd sd Outputs:

48: PascelCase
59: DfsadSsdd

Aaron Harun 2010-07-09 20:00:20

Hmmm. That doesn't work straight-up for .NET's regex, but maybe with a little documentation digging...

Pat 2010-07-09 20:06:05

Updated with an actual working regex.

Aaron Harun 2010-07-09 20:16:23

You should use `\b` (word boundary) to match the beginning of the word, not `\W`.

Alan Moore 2010-07-09 21:45:53

Answer 3

A:

In Ruby:

"aCamelCaseWord".split /(?=[[:upper:]])/
=> ["a", "Camel", "Case", "Word"]

I'm using positive lookahead here, so that I can split the string right before each uppercase letter. This lets me save any initial lowercase part as well.

Ken Bloom 2010-07-09 20:02:37

That's a positive lookahead, isn't it? I can't get an equivalent to work for .NET, even when I replace `[[:upper:]]` with `[A-Z]` (http://en.wikipedia.org/wiki/Regular_expression).

Pat 2010-07-09 20:10:32

.NET regex doesn't support the POSIX character class syntax. You could use `\p{Lu}` instead, but `[A-Z]` will probably suffice. Anyway, this approach is way too simplistic. Check out the other question, especially the `split` regex @poly came up with. It really is that complicated.

Alan Moore 2010-07-09 22:02:30

@Pat: that Wikipedia article is not meant to be used as a reference; too general and too theoretical. This site is much more useful: http://www.regular-expressions.info/

Alan Moore 2010-07-09 22:12:57

Answer 4

+3 A:

Answered in a different question:

void Main()
{
    "aCamelCaseWord".ToFriendlyCase().Dump();
}

public static class Extensions
{
    public static string ToFriendlyCase(this string PascalString)
    {
        return Regex.Replace(PascalString, "(?!^)([A-Z])", " $1");
    }
}

Outputs a Camel Case Word (.Dump() just prints to the console).

Pat 2010-07-09 20:03:22

What must happen for the strings like this: `aCamelCaseXML`? Reading the question, I would expect `a Camel Case XML`. Instead, it gives `a Camel Case X M L`.

MainMa 2010-07-09 20:15:58

@MainMa That's true. Following .NET naming standards, any acronyms three letters or longer (e.g. XML) would be in proper case (i.e. Xml), but two-letter acronyms (e.g. IP for IPAddress) would still cause a problem. It would be better to have the algorithm handle this case.

Pat 2010-07-09 20:23:48

Is there any out-the-box funtion that does this?

Shimmy 2010-10-04 03:03:19

Answer 5

+6 A:

See this question: Is there a elegant way to parse a word and add spaces before capital letters? Its accepted answer covers what you want, including numbers and several uppercase letters in a row. While this sample has words starting in uppercase, it it equally valid when the first word is in lowercase.

string[] tests = {
   "AutomaticTrackingSystem",
   "XMLEditor",
   "AnXMLAndXSLT2.0Tool",
};


Regex r = new Regex(
    @"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[A-Za-z])(?=[^A-Za-z])"
  );

foreach (string s in tests)
  r.Replace(s, " ");

The above will output:

[Automatic][Tracking][System]
[XML][Editor]
[An][XML][And][XSLT][2.0][Tool]

chilltemp 2010-07-09 20:11:44

The accepted answer is yet another RegExp-based solution.

Steven Sudit 2010-07-09 20:23:18

@Steven Sudit: Yes. RegEx is one of the best tools for this type of problem. The other question is just got flushed out with a larger set of sample use cases.

chilltemp 2010-07-09 20:37:29

@chilltemp, do you know of a built-in function for it?

Shimmy 2010-10-04 03:02:51

@Shimmy: No. I'd recommend that you use the information in the linked question to create a reusable library.

chilltemp 2010-10-04 15:53:59

I made my own function that doesn't use regex.

Shimmy 2010-10-04 18:22:15

@Shimmy: Ok, but why?

chilltemp 2010-10-04 18:31:50

@chilltemp, I think it costs less performance.If I am wrong correct me and I'll use the regex way.

Shimmy 2010-10-04 22:39:59

@Shimmy: Performance varies greatly depending upon many factors including the how complex the RegEx is and if it is compiled. Just like the performance of C# varies depending upon how you use it. That being said, I've always found RegEx in .NET to be fast enough for my needs (real-time transactional system with high throughput). The only ways to really compare is to look at the generated IL and/or do timed test runs.

chilltemp 2010-10-05 18:53:44

Agreed. I went to your function. BTW, I edited your answer so users don't have to say "So?".

Shimmy 2010-10-05 22:37:31

@Shimmy: Ok, thanks.

chilltemp 2010-10-06 15:08:59

Answer 6

+2 A:

How about:

static IEnumerable<string> SplitPascalCase(this string text)
{
    var sb = new StringBuilder();
    using (var reader = new StringReader(text))
    {
        while (reader.Peek() != -1)
        {
            char c = (char)reader.Read();
            if (char.IsUpper(c) && sb.Length > 0)
            {
                yield return sb.ToString();
                sb.Length = 0;
            }

            sb.Append(c);
        }
    }

    if (sb.Length > 0)
        yield return sb.ToString();
}

Dan Tao 2010-07-09 20:12:23

This would be a "by hand" solution.

Steven Sudit 2010-07-09 20:20:16

@Steven Sudit: Yeah... was that forbidden or something?

Dan Tao 2010-07-10 21:23:22

@Dan: No, no, not at all. There was just some confusion about what "by hand" meant, when I suggested that to Pat as an alternative to RegExp. In fact, I think that RegExp, for all its power, is overused. For many jobs, it's a bad fit, leading to cryptic code and poor performance.

Steven Sudit 2010-07-11 04:29:21

ansaurus

tags:

views:

answers:

Split a PascalCase string into separate words

related questions