views:

59

answers:

4

If I feed a speech synthesizer (festival, in this case, but it applies to all) the following bit of text:

"At the USPGA championship in the US, the BBC reporter went MIA". it reads "At the uspga championship in the us, the BBC reporter went mia".

In other words, I guess that because it's a cluster of consonants, it reads "BBC" properly but makes "words" out of the others.

The simplest thing to do, I suppose, would be to run it through a php script which looked for 2 or more capital letters, and simply "explodes" the word into spaces, like U S P G A.

I realise it would would cause weirdness with things like "I told him N O T to do that", but in news reports that tends to happen less.

Here's the thing; I can "explode" a word OK, the problem is, I'm one of those people who, despite months of trying, just can't get their head round certain aspects of REGEX. In this case, it's looking for: two or more letters next to each other in capitals.

The reason I gave all the pre-amble above is in case there's a better way of doing this I hadn't found or through of - perhaps a db of acronyms to words or something.

+5  A: 

A pattern to match acronyms:

/\b([A-Z]{2,})\b/

That matches any 'word' with two or more capitals.

Delan Azabani
A: 

"[A-Z][A-Z]" will match any instance of two capital letters next to each other.

teukkam
+2  A: 

Using Delan's regular expression with preg_replace_callback() makes it very easy to put a single space between all the letters of the identified acronyms

$input = "At the USPGA championship in the US, the BBC reporter went MIA";

function cb_separateCapitals($matches) {
    return implode(' ',str_split($matches[0]));
}


echo $input,'<br />';

$output = preg_replace_callback('/\b([A-Z]{2,})\b/','cb_separateCapitals',$input);

echo $output;

giving

At the USPGA championship in the US, the BBC reporter went MIA

At the U S P G A championship in the U S, the B B C reporter went M I A

Mark Baker
Very nice! I like your very useful adaptation (which actually fully answers the question now)
Delan Azabani
Wow - I'm in awe. That's exactly what I was looking for. Thanks to both you and Delan.
talkingnews
No problem, happy to help ;)
Delan Azabani
@Delan - I did give you an upvote before using your expression :)
Mark Baker
+4  A: 

you can greatly simplify your code by using a lookahead assertion

$input = "At the USPGA championship in the US, the BBC reporter went MIA";
echo preg_replace('~[A-Z](?=[A-Z])~', '$0 ', $input);

[A-Z](?=[A-Z]) says "every capital followed by a capital"

stereofrog