ansaurus

Question

Regular expression, split string by capital letter but ignore TLA

Answer 1

+6 A:

any uppercase character that is not followed by an uppercase character:

Replace(string, "([A-Z])(?![A-Z])", " $1")

Edit:

I just noticed that you're using this for enumerations. I really do not encourage using string representations of enumerations like this, and the problems at hand is a good reason why. Have a look at this instead: http://www.refactoring.com/catalog/replaceTypeCodeWithClass.html

David Hedlund 2009-07-08 13:00:35

That doesn't handle "I", i.e. "IAmBored" will not be split as "I Am Bored" as I assume the OP would expect.

Brian Rasmussen 2009-07-08 13:16:30

i think you're mistaken. try this javascript for yourself:alert("IAmBored".replace(/([A-Z])(?![A-Z])/g, " $1"));it will match "A" and "B" as both are not followed by an uppercase character, and be replaced into " A" and " B" respectively

David Hedlund 2009-07-08 13:52:23

(although i just realized that you're just mistaken with your choice of example, the general point is still accurate, for when the "I" is in the middle of a sentence)

David Hedlund 2009-07-08 13:57:09

It also inserts a space before the "A" in "BornInTheUSA".

Alan Moore 2009-12-19 10:43:45

Answer 2

+1 A:

You might think about changing the enumerations; MS coding guidelines suggest Pascal casing acronyms as though they were words; XmlDocument, HtmlWriter, etc. Two-letter acryonyms don't follow this rule, though; System.IO.

So you should be using UsaToday, and your problem will disappear.

Steve Cooper 2009-07-08 13:03:47

While I'm totally with you in general, this does not really solve the problem. If he'd written UsaToday, this would result in the split (i.e. human-readable) string as "Usa Today", which is kind of strange since it's always written USA. Therefore I can understand the desire to retain capitalization. On the other hand, if one wanted to show enum names to users, one should go with another solution (I tend to have string resources like EnumName_ValueName, so the key can be easily generated in code, are searchable in the resource file and can be easily localized).

OregonGhost 2009-07-08 14:23:33

Answer 3

+3 A:

((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))

when replaced with

" $1"

handles

TodayILiveInTheUSAWithSimon
USAToday
IAmSOOOBored

yielding

 Today I Live In The USA With Simon
USA Today
I Am SOOO Bored

In a second step you'd have to trim the string.

Tomalak 2009-07-08 13:21:30

Sorry, you lost me a bit! Like this: Replace(stringToSplit, "([A-Z])(?=[a-z])|(?<=[a-z])([A-Z])", " \1") ?

Simon 2009-07-08 13:33:10

The `\1` means back-reference #1. In .NET regexes, this is expressed as `$1`. Other than that, your statement seems correct.

Tomalak 2009-07-08 13:47:42

(Oh, and I have changed my regex a bit. You are using the one from an older version of the answer.)

Tomalak 2009-07-08 13:56:50

I've edited the answer so it uses the .NET style back-reference.

Tomalak 2009-07-08 14:01:43

`([A-Z])(?<=[a-z]\1|[A-Za-z]\1(?=[a-z]))` doesn't add the space at the beginning because it can never match the first letter. :)

Alan Moore 2009-12-19 05:18:40

Answer 4

A:

Tomalak's expression worked for me, but not with the built-in Replace function. Regex.Replace(), however, did work.

For i As Integer = 0 To names.Length - 1
  'Worked
  names(i) = Regex.Replace(names(i), "((?<=[a-z])[A-Z]|[A-Z](?=[a-z]))", " $1").TrimStart()

  ' Didn't work
  'names(i) = Replace(names(i), "([A-Z])(?=[a-z])|(?<=[a-z])([A-Z])", " $1").TrimStart()
Next

BTW, I'm using this to split the words in enumeration names for display in the UI and it works beautifully.

Craig Boland 2009-12-19 00:34:52

ansaurus

tags:

views:

answers:

Regular expression, split string by capital letter but ignore TLA

related questions