ansaurus

Question

Simple python/Regex problem: Removing all new lines from a file

Answer 1

A:

import re
re.sub("\n", "", file-contents-here)

Alix Axel 2009-08-08 19:21:59

so i'm going to have to manually open the file, read it character for character into a string, do a sub and write back to the file character by character?

Chris 2009-08-08 19:24:36

or better re.sub("[\n\r]+", "", file-contents);

Diaa Sami 2009-08-08 19:26:37

@Chris: `open(fname).read()` gives you a string, after filtering you can write it as `open(fname2, 'w').write(output_sting)`. What exactly character by character means?

SilentGhost 2009-08-08 19:26:40

@Chris: I guess so, but I'm not an expert at Python.

Alix Axel 2009-08-08 19:32:30

Answer 2

+3 A:

I wouldn't use a regex for simply replacing newlines - I'd use string.replace(). Here's a complete script:

f = open('input.txt')
contents = f.read()
f.close()
new_contents = contents.replace('\n', '')
f = open('output.txt', 'w')
f.write(new_contents)
f.close()

RichieHindle 2009-08-08 19:33:14

Nice, the new line is inside single quotes. Does that matter in Python?

Alix Axel 2009-08-08 19:34:32

nope .

SilentGhost 2009-08-08 19:40:28

Strings can use single or double quotes in Python - they're equivalent.

RichieHindle 2009-08-08 19:49:10

Answer 3

+4 A:

The two main alternatives: read everything in as a single string and remove newlines:

clean = open('thefile.txt').read().replace('\n', '')

or, read line by line, removing the newline that ends each line, and join it up again:

clean = ''.join(l[:-1] for l in open('thefile.txt'))

The former alternative is probably faster, but, as always, I strongly recommend you MEASURE speed (e.g., use python -mtimeit) in cases of your specific interest, rather than just assuming you know how performance will be. REs are probably slower, but, again: don't guess, MEASURE!

So here are some numbers for a specific text file on my laptop:

$ python -mtimeit -s"import re" "re.sub('\n','',open('AV1611Bible.txt').read())"
10 loops, best of 3: 53.9 msec per loop
$ python -mtimeit "''.join(l[:-1] for l in open('AV1611Bible.txt'))"
10 loops, best of 3: 51.3 msec per loop
$ python -mtimeit "open('AV1611Bible.txt').read().replace('\n', '')"
10 loops, best of 3: 35.1 msec per loop

The file is a version of the KJ Bible, downloaded and unzipped from here (I do think it's important to run such measurements on one easily fetched file, so others can easily reproduce them!).

Of course, a few milliseconds more or less on a file of 4.3 MB, 34,000 lines, may not matter much to you one way or another; but as the fastest approach is also the simplest one (far from an unusual occurrence, especially in Python;-), I think that's a pretty good recommendation.

Alex Martelli 2009-08-08 19:54:28

How about string.strip()? i.e. python -mtimeit "''.join(l.strip() for l in open('AV1611Bible.txt'))"

hughdbrown 2009-08-08 20:14:03

That has different semantics, since it would remove leading and trailing spaces, which is NOT part of the specs (even rstrip would still remove trailing spaces, again outside the specs). Anyway, both are very marginally slower than using l[:-1], by about 3%, repeatably.

Alex Martelli 2009-08-08 22:06:41

Answer 4

+1 A:

I know this is a python learning problem, but if you're ever trying to do this from the command-line, there's no need to write a python script. Here are a couple of other ways:

cat $FILE | tr -d '\n'

awk '{printf("%s", $0)}' $FILE

Neither of these has to read the entire file into memory, so if you've got an enormous file to process, they might be better than the python solutions provided.

Jefromi 2009-08-08 19:58:10

Not python, but +1 for mentioning the large file problem, which is always good to keep in mind.

Pinochle 2009-08-08 20:13:31

no need cat for the tr code. tr -d '\n' < file

ghostdog74 2009-08-09 01:01:27

ansaurus

tags:

views:

answers:

Simple python/Regex problem: Removing all new lines from a file

related questions