ansaurus

Question

What's the regex for removing dots in acronyms but not in domain names?

Answer 1

+1 A:

I suggest you split the string at '@' (or whatever character makes sense), do the substitution on the first part, then put the string back together. I think that will show the intent of the code better than a complex regexp. Something like this, perhaps:

string='a.b.c. [email protected] http://www.test.com'
left, rest = string.split("@",1)
left = left.replace(".","")
result="%s@%s" % (left, rest)

Bryan Oakley 2009-08-14 17:37:46

Answer 2

+2 A:

You could simply remove DOTS that don't have two [a-z] letters (or more) ahead of them:

\.(?![a-zA-Z]{2})

But that will of course also remove the first DOT from the following address:

[email protected]

You could fix that by doing:

\.(?![a-zA-Z]{2}|[^\s@]*+@)

but I'm sure there will be many more such corner cases.

Bart Kiers 2009-08-14 17:39:49

Thanks for this suggestion. This was the basis of my answer. It did come to my mind before but I'm mistaken in not pursuing it.

ianalis 2009-08-14 21:39:45

Answer 3

+5 A:

If your data is always formatted like this then why not split your data into 3 parts by splitting on the space.

Then it's pretty trivial to remove the periods from the first element and use join to remerge the parts.

chollida 2009-08-14 17:40:06

It's not always formatted like this. I will be using it on informal human conversations as mentioned in my newly-added comment.

ianalis 2009-08-14 21:29:58

Answer 4

A:

Not as elegant as a simple re.sub(), but try this:

import re

s='a.b.c. [email protected] http://www.test.com'
m=re.search('(.*?)(([a-zA-Z]\.){2,})(.*)', s)

if m:
    replacement=''.join(m.group(2).split('.'))
    s=m.group(1)+replacement+m.group(4)

print s

It assumes that there's no more than one acronym per string, but you could always run it repeatedly.

Head Geek 2009-08-14 18:30:24

Answer 5

A:

The following worked for me (with thanks to Bart for his answer):

re.sub('\.(?!(\S[^. ])|\d)', '', s)

This will not remove a dot if it is the first character in a word or acronym.

ianalis 2009-08-14 21:49:47

Answer 6

+1 A:

A non-regex way:

>>> S = 'a.b.c. [email protected] http://www.test.com'
>>> ' '.join(w if '@' in w or ':' in w else w.replace('.', '') for w in S.split())
'abc [email protected] http://www.test.com'

(Requires spaces to split on, though - so if you had something like commas with no spaces it could miss some.)

Anon 2009-08-15 00:00:57

ansaurus

tags:

views:

answers:

What's the regex for removing dots in acronyms but not in domain names?

related questions