tags:

views:

57

answers:

3

Hello,

I have a series of text fields, which I need to clean of all full stops. The input text is company names, which sometimes contain abbreviations and sometimes full stops for other reasons.

I would like to remove the full stops when the text is an abbreviation, otherwise, I would like to replace it with a space. I would define an abbreviation as a series of pairs of a single alphabetical character followed by a full stop.

Example inputs and desired outputs:
input --> Desired Output

U.K. --> UK

E.U. --> EU

bank.of --> bank of

help.co.uk --> help co uk

Would anybody know of a regex or other method which could help me to identify the full stops I wish to remove rather than replace?

Thanks!!!

A: 

Try

(?<=[^a-zA-Z][a-zA-Z])\.(?=[a-zA-Z][^a-zA-Z]| )

for matching the full stops in abbreviations.

Jens
A: 

You could try matching against something like

^[\w.]+$

If the string matches (assuming its just one input) then it is an abbreviation, if not then it is a set of words separated by fullstops/periods. Be sure to strip away whitespace though. Or you could incorporate it in the regex with

^\s*[\w.]+\s*$

This basically says find as many pairs of char and period as possible. If the whole string (that's what the anchors ^ and $ are for) matches it's an abbreviation.

This regex will match U.K. but will not match bank.co.uk or even ba.u.k (because of the two letters together ba). You can then handle each case based on if the string matches the regex or not, if its an abbreviation then replace "." with "" and if not then replace "." with " ".

ameer
+1  A: 

Do it in two steps:

var s = "U.K. bank.of help.co.uk E.U";

//replace periods in abbreviations
var r1 = new RegExp("\\b([A-Z])\\.", 'g');
s = s.replace(r1, "$1");
console.log(s);    //UK bank.of help.co.uk EU

//replace remaining spaces:
s = s.replace(/\./g, " ");
console.log(s); //UK bank of help co uk EU

The given regexes are in JavaScript; leave a comment if you need help translating them to Java.

Amarghosh
Thanks for the help! I used a slight variation on this to identify the abbreviations: "(\\b([A-Z])\\.)+"