views:

117

answers:

3

What am I doing wrong/what can I do?

import sys
import string

def remove(file):
    punctuation = string.punctuation
    for ch in file:
        if len(ch) > 1:
            print('error - ch is larger than 1 --| {0} |--'.format(ch))
        if ch in punctuation:
            ch = ' '
            return ch
        else:
            return ch

ref = (open("ref.txt","r"))
test_file = (open("test.txt", "r"))

dictionary = ref.read().split()
file = test_file.read().lower()
file = remove(file)
print(file)

This is in Python 3.1.2

A: 

Check out the re (regular expression) module. It has a "sub" function to replace strings that match regular expressions.

Andrew E. Falcon
+1  A: 

In python, strings are immutable, so you need to create a new string with your changes.

There are a few ways to do this:

One is using a list comprehension to inspect the characters and only returning the non-punctuation.

def remove(file):
  return ''.join(ch for ch in file if ch not in string.punctuation)

You could also call functions to test the character or translate the character which you might have throw "weird character" exceptions or do some other functionality:

def remove(file):
  return ''.join(TranslateCh(ch) for ch in file if CheckCh(ch))

Another alternative is the string module, providing replace or translate. Translate provides a nice (and more efficient than building a list) mechanism for this, see Alex's answer.

Or... you could collect a list over a forloop and join it at the end, but that's a little "unpythonic".

Stephen
+1. No need for the brackets.
Adam Bernier
thank you so much for that, greatly appreciated
Ajay Hopkins
@Adam : true, thanks.
Stephen
`string.maketrans` is for byte strings, and deprecated in Python 3 in favor of `bytes.maketrans` -- definitely not what the OP needs in Python 3.
Alex Martelli
@Alex : hm, interesting, thanks. removed the suggestion.
Stephen
+2  A: 

In this code...:

for ch in file:
        if len(ch) > 1:

the weirdly-named file (besides breaking the best practice of not hiding builtin names with your own identifier) is not a file, it's a string -- which means unicode, in Python 3, but that makes no difference to the fact that the loop is returning single characters (unicode characters, not bytes, in Python 3) so len(ch) == 1 is absolutely guaranteed by the rules of the Python language. Not sure what you're trying to accomplish with that test (rule out some subset of unicode characters?), but, whatever it is you thing you're achieving, I assure you that you're not achieving it and should recode that part.

Apart from this, you're returning -- and therefore exiting the function -- immediately, and thereby exiting the function and returning just one character (the first one in the file, or a space if that first one was a punctuation character).

The suggestion to use the translate method, which I saw in another answer, is the right one, but that answer used the wrong version of translate (one applying to byte strings, not to unicode strings as you need for Python 3). The proper unicode version is simpler, and transforms the whole body of your function into just two statements:

trans = dict.fromkeys(map(ord, string.punctuation), ' ')
return file.translate(trans)
Alex Martelli