tags:

views:

1309

answers:

7

I'm looking for a Perl regex that will capitalize any character which is preceded by whitespace (or the first char in the string).

I'm pretty sure there is a simple way to do this, but I don't have my Perl book handy and I don't do this often enough that I've memorized it...

+8  A: 
s/(\s\w)/\U$1\E/g;

I originally suggested:

s/\s\w/\U$&\E/g;

but alarm bells were going off at the use of '$&' (even before I read @Manni's comment). It turns out that they're fully justified - using the $&, $` and $' operations cause an overall inefficiency in regexes.

The \E is not critical for this regex; it turns off the 'case-setting' switch \U in this case or \L for lower-case.


As noted in the comments, matching the first character of the string requires:

s/((?:^|\s)\w)/\U$1\E/g;

Corrected position of second close parenthesis - thanks, Blixtor.

Jonathan Leffler
you forgot the first char of the string: s/(\s|^)\w/\U$
Node
I've never seen a regex like that - can you explain it?
Paul Tomblin
innaM
If we're looking for "following whitespace", wouldn't a lookahead be better?
rjh
I don't see why a lookahead would be helpful. If we were looking to capitalize the last letter in a word, then yes; we're looking to capitalize the first, though.
Jonathan Leffler
The regex given last in the answer is incorrect. It should be s/((?:^|\s)\w)/\U$1\E/g;
+6  A: 

Something like this should do the trick -

s!(^|\s)(\w)!$1\U$2!g

This simply splits up the scanned expression into two matches - $1 for the blank/start of string and $2 for the first character of word. We then substitute both $1 and $2 after making the start of the word upper-case.

I would change the \s to \b which makes more sense since we are checking for word-boundaries here.

muteW
+7  A: 

Depending on your exact problem, this could be more complicated than you think and a simple regex might not work. Have you thought about capitalization inside the word? What if the word starts with punctuation like '...Word'? Are there any exceptions? What about international characters?

It might be better to use a CPAN module like Text::Autoformat or Text::Capitalize where these problems have already been solved.

use Text::Capitalize 0.2;
print capitalize_title($t), "\n";

use Text::Autoformat;
print autoformat{case => "highlight", right=>length($t)}, $t;

It sounds like Text::Autoformat might be more "standard" and I would try that first. Its written by Damian. But Text::Capitalize does a few things that Text::Autoformat doesn't. Here is a comparison.

You can also check out the Perl Cookbook for recipie 1.14 (page 31) on how to use regexps to properly capitalize a title or headline.

Eric Johnson
This is a good point about punctuation being a potential issue.
Chris Lutz
thanks, this is quite useful
Kip
A: 

You want to match letters behind whitespace, or at the start of a string.

Perl can't do variable length lookbehind. If it did, you could have used this:

s/(?<=\s|^)(\w)/\u$1/g;    # this does not work!

Perl complains:

Variable length lookbehind not implemented in regex;

You can use double negative lookbehind to get around that: the thing on the left of it must not be anything that is not whitespace. That means it'll match at the start of the string, but if there is anything in front of it, it must be whitespace.

s/(?<!\S)(\w)/\u$1/g;

The simpler approach in this exact case will probably be to just match the whitespace; the variable length restriction falls away, then, and include that in the replacement.

s/(\s|^)(\w)/$1\u$2/g;

Occasionally you can't use this approach in repeated substitutions because that what precedes the actual match has already been eaten by the regex, and it's good to have a way around that.

bart
A: 

This isn't something I'd normally use a regex for, but my solution isn't exactly what you would call "beautiful":

$string = join("", map(ucfirst, split(/(\s+)/, $string)));

That split()s the string by whitespace and captures all the whitespace, then goes through each element of the list and does ucfirst on them (making the first character uppercase), then join()s them back together as a single string. Not awful, but perhaps you'll like a regex more. I personally just don't like \Q or \U or other semi-awkward regex constructs.

EDIT: Someone else mentioned that punctuation might be a potential issue. If, say, you want this:

...string

changed to this:

...String

i.e. you want words capitalized even if there is punctuation before them, try something more like this:

$string = join("", map(ucfirst, split(/(\w+)/, $string)));

Same thing, but it split()s on words (\w+) so that the captured elements of the list are word-only. Same overall effect, but will capitalize words that may not start with a word character. Change \w to [a-zA-Z] to eliminate trying to capitalize numbers. And just generally tweak it however you like.

Chris Lutz
@Volomike - What version of Perl are you using?
Chris Lutz
Oh shoot. That was Perl? My bad! :) I'll delete my comment. I thought it was PHP.
Volomike
+1  A: 

If you mean character after space, use regular expressions using \s. If you really mean first character in word you should use \b instead of all above attempts with \s which is error prone.

s/\b(\w)/\U$1/g;
Hynek -Pichi- Vychodil
Wait... \s is error prone? Explain what you mean and why you mean it.
Chris Lutz
\s isn't error-prone if the requirements say to uppercase things that are after whitespace. The \b is also a problem. See the perlfaq on making things title case for an example.
brian d foy
Words not always starting after space. For example: 'Levenberg--Marquardt algorithm'. There are two words which not starts after space, nor Levenberg nor Marquardt.
Hynek -Pichi- Vychodil
Err. But the OP is specifically asking for a "character which is preceeded by whitespace". So you might argue that his requirements are "error prone". But \s?
innaM
Yes, you are right. Specification is may be wrong.
Hynek -Pichi- Vychodil
@Manni - He said "which is preceeded by whitespace (or the first char in the string)", so Hynek is right here.
Chris Lutz
actually \b was one of the tricks i was trying to remember, although the way i asked the question didn't make this clear.
Kip
A: 

Capitalize ANY character preceded by whitespace or at beginning of string:

s/(^|\s)./\u$1/g

Maybe a very sloppy way of doing it because it's also uppercasing the whitespace now. :P The advantage is that it works with letters with all possible accents (and also with special Danish/Swedish/Norwegian letters), which are problematic when you use \w and \b in your regex. Can I expect that all non-letters are untouched by the uppercase modifier?