views:

55

answers:

4

How can I remove special characters and letters from a line read from a text file while preserving the whitespaces? Let's say we have the following contents in a file:

16 ` C38# 26535 2010 4 14 2 7 7 3 8^@1 2 15 100 140 30 $ 14^] (2003 2 ! -6 �021 0 � 14 ! 2 3! 1 0 35454 0$ ^@0 0 0 "0 "63 194 (56 188 26 27" 24 0 0 10� 994! 8 58 0 0 " � 0 0 32�47 32767 32767 ! 1

The output basically should be:

16 38 26535 2010 4 14 2 7 7 3 8 1 2 15 100 140 30 14 2003 2 -6 021 0 14 2 3 1 0 35454 0 0 0 0 0 63 194 56 188 26 27 24 0 0 10 994 8 58 0 0 0 0 32 47 32767 32767 1

What's the most straightforward way to do this?

+2  A: 
import re

output_string = re.sub(r'[^\d\s-]', '', input_string)

The pattern [^\d\s-] will match anything that's not a digit, dash, or whitespace - thus, replacing any match with an empty string will remove everything except the numbers (including minus signs) and whitespace.

Amber
+1  A: 

If you want to keep just digits, plus and minus signs, and all whitespace, simplest might be

import re
   ...
line = re.sub(r'[^\d\s+-]+', '', line)

which reads "replace each sequence of one or more non-digit non-whitespace with nothing".

Faster would be the translate method of strings, but it is quite a bit less simple to set up, so, since you ask for "straightforward", I suggest the re approach (now brace for the sure-to-come screeches of the re-haters...;-).

Alex Martelli
A: 
''.join([x for x in s if x in string.digits+string.whitespace])

or if what you really want is a list of the numbers:

import re
re.findall('\d+',s)
Matt Curtis
A: 

LOL @Alex's regex comment... hopefully there aren't too many haters. With that said however, although they're faster because they're executed in C, regexes aren't my first choice... perhaps i've been biased by the famous jwz quote: '''Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.'''

I will say that solving this homework exercise is tricky because solutions are fraught with errors, as seen in the existing solutions so far. Perhaps this is serendipity because it requires the OP to debug and correct those suggestions instead of just cutting-and-pasting them verbatim into their assignment solution.

As far as the problems go, they include but are not limited to:

  • leaving successive spaces
  • removing negative signs, and
  • merging multiple numbers together

Bottom line... which solutions do I like best? I would start one of the following and debug from there:

For regex, i'll pick:

@Alex's solution or @Matt's if I want just the data instead of the "golden" string

For string processing, I'll modify @Matt's solution to:

keep = set(string.whitespace+string.digits+'+-')
line = ''.join(x for x in line if x in keep)

Finally, @Greg has a good point. Without a clear spec, these are just partial solutions.

wescpy