tags:

views:

960

answers:

5

I have a string like: Today, 3:30pm - Group Meeting to discuss "big idea"

How do you construct a regex such that after parsing it would return: Today 3:30pm Group Meeting to discuss big idea

I would like it to remove all non-alphanumeric characters except for those that appear in a 12 or 24 hour time stamp.

All help is appreciated. Thanks

+1  A: 

I assume you'd like to keep spaces as well, and this implementation is in python, but it's PCRE so it should be portable.

import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
re.sub(r'[^a-zA-Z0-9: ]', '', x)

Output: 'Today 3:30pm  Group Meeting to discuss big idea'

for a slightly cleaner answer (no double spaces)

import re
x = u'Today, 3:30pm - Group Meeting to discuss "big idea"'
tmp = re.sub(r'[^a-zA-Z0-9: ]', '', x)
re.sub(r'[ ]+', ' ', tmp)

Output: 'Today 3:30pm Group Meeting to discuss big idea'

Bryan McLemore
What about "Today, 3:30pm - Group meeting: discuss big idea" - the colon after "meeting" won't be removed.
Greg Hewgill
@Cadwag, this solution removes colons even when they are outside of timestamps. Surely you don't want this?
J-P
Yea, in my excitement I seem to have acted prematurely. But it seems to act as Greg Hewgill says - leaving colons that are outside of timestamps
cadwag
Not sure about Python, but my C# solution below might solve this problem with neg. look forward / backward. Also check Rubens Farias solution, that should work with Python too.
Abel
+4  A: 
Abel
@Cadwag, you said you got an error about neg look forward/behind must be fixed width only. That's a restriction of many regex flavors (not .NET though). I'll update my answer with that in mind.
Abel
Thanks alot for your help.I am trying your solution in Unix normal python using the python example given by 'Bryan McLemore'. So using your solution, it the 3rd line looks like `re.sub(r'(<![012]?\d):(>!\d\d(?:[ap]m)?)|[^A-Za-z\d: ]', '', x)`But it doesn't seem to do anything when I run it. I'm sorry for all the hassle. I'm just starting out with python and have never been very good with regex. Thanks again
cadwag
Nice work Abel, just testing it though, it doesn't seem to match the colon in "3:3".
J-P
@J-P: is that testing in Python or in .NET? And did you use my original design, because then: indeed, it would consider 3:3 as a time.
Abel
@Cadwag: if you've trouble with regexes, check this list of online regex testers: http://www.undermyhat.org/blog/2009/09/overview-of-online-regular-expression-testers/. PCRE is what I believe Python uses internally.
Abel
Hmm, I thought PCRE lookbehinds were constructed like `(?<!` ... not `(<!`
J-P
Thanks, J-P, that was exactly my mistake! (and a few others, updating now to correct them)
Abel
Abel, hats off to you. Really went above and beyond the call of duty there. Your latest solution seems to work perfectly!Seriously, thanks so much for all your help.
cadwag
You're welcome, glad to be of help. Make sure to check the explanation, it may help ;-) This proofs very helpful when dealing with this kind of stuff (visualizer): http://regex.powertoy.org/ (turn it into Perl mode)
Abel
+1  A: 

You can try, in Javascript:

var re = /(\W+(?!\d{2}[ap]m))/gi;
var input = 'Today, 3:30pm - Group Meeting to discuss "big idea"';
alert(input.replace(re, " "))
Rubens Farias
Interesting how many solutions are given. You replace any non-word character with a space, that means `discuss "big idea"` becomes `discuss big idea ` (i.e., extra spaces). Use something like `/(( )|\W)(?!\d{2})/g;` and `.replace(re, "$2")` (or was it `\1` in JS?). This will leave the spaces and remove the rest. I call this "conditional replacement".
Abel
hmm, yet another Markdown in comments bug: the extra space in `discuss big` got lost...
Abel
interesting approach, Abel, ty
Rubens Farias
+1  A: 

Python.

import string
punct=string.punctuation
s='Today, 3:30pm - Group Meeting:am to discuss "big idea" by our madam'
for item in s.split():
    try:
        t=time.strptime(item,"%H:%M%p")
    except:
        item=''.join([ i for i in item if i not in punct])
    else:
        item=item
    print item,

output

$ ./python.py
Today 3:30pm  Group Meetingam to discuss big idea by our madam

# change to s='Today, 15:30pm - Group 1,2,3 Meeting to di4sc::uss3: 2:3:4 "big idea" on 03:33pm or 16:47 is also good'

$ ./python.py
Today 15:30pm  Group 123 Meeting to di4scuss3 234 big idea on 03:33pm or 1647 is also good

NB: Method should be improved to check for valid time only when necessary(by imposing conditions) , but i will leave it as that for now.

ghostdog74
Nice approach, but you need a few more tweaks to handle the 24-hour time stamp requirement ("15:30" instead of "3:30pm")
Ned Deily
what do you mean? it doesn't matter right? %H is from 00 to 24 inclusive.
ghostdog74
`16:47` becomes `1647` in your example, I think that's what Ned means.
Abel
Btw, though it isn't specified in the q., my solution allows time in text, yours splits on word boundaries prior to that: "This12:40 is late" is silly of course, not sure how the OP would want to deal with that (my solution leaves the colon, yours will delete it).
Abel
@abel, i see. anyway, there are much to take care of since we are only working on limited data in this case. I will just leave it as that.
ghostdog74
+1 for the alternative non-regex approach anyway! Nice example.
Abel
A: 

s="Call me, my dear, at 3:30"

re.sub(r'[^\w :]','',s)

'Call me my dear at 3:30'

Jyotirmoy Bhattacharya