views:

140

answers:

6

Hi,

I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.

So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.

What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.

Any ideas? Thanks, Rukki

+2  A: 

Try to split with this regular expression:

/(?=[A-Z][a-z])/

And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:

/([A-Z])(?![A-Z])/

Replace it with " $1" (space plus match of the first group). Then you can split at the space.

Gumbo
I am using Python. Unfortunately pattern (?=[A-Z][a-z]) matches nothing and pattern ([A-Z])(?![A-Z]) gives "W S T " :(
Rukki Odds
@Rukki Odds: I wrote you have to replace the matches of `([A-Z])(?![A-Z])` with a space followed by the match of the first group to separate the words. And then you can use `\s+` to split the words.
Gumbo
`([A-Z])(?![A-Z])` won't help if there is an all-caps word in the middle of the string: eg on 'HDMWhoSomeMONKEYThing' will only match W,S and T in 'Who', 'Some' and 'Thing'
Johrn
+1  A: 

May be '[A-Z]*?[A-Z][a-z]+'?

Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+

import re

def find_stuff(str):
  p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
  m = p.findall(str)
  result = ''
  for x in m:
    result += x + ' '
  print result

find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')

Prints out:

HDM Who Some Thing

Some HDM Who Thing

Maxwell Troy Milton King
Almost correct - but it gives "HDMWho Some Thing " instead od "HDM Who Some Thing"
Rukki Odds
Sorry I couldn't try it. Whats your actual code?
Maxwell Troy Milton King
my exact code: str = 'HDMWhoSomeThing'\n p = re.compile('[A-Z]*?[A-Z][a-z]+')\n m = p.findall('HDMWhoSomeThing')\n result = ''\n for x in m:\n result += x + ' '\n print result\nI have changed end of the lines with \n
Rukki Odds
Misses a one-character uppercase word anywhere in the string: 'HDMWhoSomeAThingA' won't match either 'A' (change the {2,} to a + and it will grab the one-character words)
Johrn
+1  A: 

So 'words' in this case are:

  1. Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
  2. One uppercase letter followed by any number of lowercase letters.

so try:

([A-Z]+(?![a-z])|[A-Z][a-z]*)

The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.

Johrn
+2  A: 

one liner :

' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))

using regexp

([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))

makapuf
This won't match a one-character uppercase word at the end of the string. Doesn't matter for the OP's application of inserting spaces between them, but might be a problem if something else needs to be done with all the words.
Johrn
+2  A: 
#! /usr/bin/env python

import re
from collections import deque

pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))

result = []
while len(chunks):
  buf = chunks.popleft()
  if len(buf) == 0:
    continue
  if re.match(r'^[A-Z]$', buf) and len(chunks):
    buf += chunks.popleft()
  result.append(buf)

print ' '.join(result)

Output:

HDM Who Some MONKEY Thing XYZ

Judging by lines of code, this task is a much more natural fit with re.findall:

pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))

Output:

HDM Who Some MONKEY Thing X
Greg Bacon
I think this one runs into the problem with all-caps words in the middle of the string as well. The re.split called on the string 'HDMWhoSomeMONKEYThing' will give you [HDM, Who, SomeMONKEY, Thing]
Johrn
@Johrn Thanks and edited!
Greg Bacon
A: 

Big big big thanks to all! You are awesome!

Rukki Odds