tags:

views:

355

answers:

6

I want to remove dots in acronyms but not in domain names in a python string. For example, I want the string

'a.b.c. [email protected] http://www.test.com'

to become

'abc [email protected] http://www.test.com'

The closest regex I made so far is

re.sub('(?:\s|\A).{1}\.',lambda s: s.group()[0:2], s)

which results to

'ab.c. [email protected] http://www.test.com'

It seems that for the above regex to work, I need to change the regex to

(?:\s|\A|\G).{1}\.

but there is no end of match marker (\G) in python.

EDIT: As I have mentioned in my comment, the strings have no specific formatting. These strings contain informal human conversations and so may contain zero, one or several acronyms or domain names. A few errors is fine by me if it would save me from coding a "real" parser.

+1  A: 

I suggest you split the string at '@' (or whatever character makes sense), do the substitution on the first part, then put the string back together. I think that will show the intent of the code better than a complex regexp. Something like this, perhaps:

string='a.b.c. [email protected] http://www.test.com'
left, rest = string.split("@",1)
left = left.replace(".","")
result="%s@%s" % (left, rest)
Bryan Oakley
+2  A: 

You could simply remove DOTS that don't have two [a-z] letters (or more) ahead of them:

\.(?![a-zA-Z]{2})

But that will of course also remove the first DOT from the following address:

[email protected]

You could fix that by doing:

\.(?![a-zA-Z]{2}|[^\s@]*+@)

but I'm sure there will be many more such corner cases.

Bart Kiers
Thanks for this suggestion. This was the basis of my answer. It did come to my mind before but I'm mistaken in not pursuing it.
ianalis
+5  A: 

If your data is always formatted like this then why not split your data into 3 parts by splitting on the space.

Then it's pretty trivial to remove the periods from the first element and use join to remerge the parts.

chollida
It's not always formatted like this. I will be using it on informal human conversations as mentioned in my newly-added comment.
ianalis
A: 

Not as elegant as a simple re.sub(), but try this:

import re

s='a.b.c. [email protected] http://www.test.com'
m=re.search('(.*?)(([a-zA-Z]\.){2,})(.*)', s)

if m:
    replacement=''.join(m.group(2).split('.'))
    s=m.group(1)+replacement+m.group(4)

print s

It assumes that there's no more than one acronym per string, but you could always run it repeatedly.

Head Geek
A: 

The following worked for me (with thanks to Bart for his answer):

re.sub('\.(?!(\S[^. ])|\d)', '', s)

This will not remove a dot if it is the first character in a word or acronym.

ianalis
+1  A: 

A non-regex way:

>>> S = 'a.b.c. [email protected] http://www.test.com'
>>> ' '.join(w if '@' in w or ':' in w else w.replace('.', '') for w in S.split())
'abc [email protected] http://www.test.com'

(Requires spaces to split on, though - so if you had something like commas with no spaces it could miss some.)

Anon