tags:

views:

173

answers:

4

I have a CSV-like text file that has about 1000 lines. Between each record in the file is a long series of dashes. The records generally end with a \n, but sometimes there is an extra \n before the end of the record. Simplified example:

"1x", "1y", "Hi there"
-------------------------------
"2x", "2y", "Hello - I'm lost"
-------------------------------
"3x", "3y", "How ya
doing?"
-------------------------------

I want to replace the extra \n's with spaces, i.e. concatenate the lines between the dashes. I thought I would be able to do this (Python 2.5):

text = open("thefile.txt", "r").read()    
better_text = re.sub(r'\n(?!\-)', ' ', text)

but that seems to replace every \n, not just the ones that are not followed by a dash. What am I doing wrong?

I am asking this question in an attempt to improve my own regex skills and understand the mistakes that I made. The end goal is to generate a text file in a format that is usable by a specific VBA for Word macro that generates a styled Word document which will then be digested by a Word-friendly CMS.

+5  A: 

You need to exclude the line breaks at the end of the separating lines. Try this:

\n(?<!-\n)(?!-)

This regular expression uses a negative look-behind assertion to exclude \n that’s preceeded by an -.

Gumbo
Thanks, I see now. I failed to define the problem thoroughly before attempting a solution, then confused things further by presuming I was replacing all \n's when actually replacing only half.
fwkb
+1  A: 
re.sub(r'(?<!-)\n(?!-)', ' ', text)

(Hyphen doesn't need escaping outside of a character class.)

chaos
… and outside of a character range declaration and at the start or end of a claracter class. `[a-z-0-9]`, `[-a-z]` and `[a-z-]` are all valid character class declarations.
Gumbo
+7  A: 

This is a good place to use a generator function to skip the lines of ----'s and yield something that the csv module can read.

def readCleanLines( someFile ):
    for line in someFile:
        if line.strip() == len(line.strip())*'-':
            continue
        yield line

reader= csv.reader( readCleanLines( someFile ) )
for row in reader:
    print row

This should handle the line breaks inside quotes seamlessly and silently.


If you want to do other things with this file, for example, save a copy with the ---- lines removed, you can do this.

with open( "source", "r" ) as someFile:
    with open( "destination", "w" ) as anotherFile:
        for line in readCleanLines( someFile ):
            anotherFile.write( line )

That will make a copy with the ---- lines removed. This isn't really worth the effort, since reading and skipping the lines is very, very fast and doesn't require any additional storage.

S.Lott
awesome idea to strip lines with a generator!
orip
BTW - don't you need len(line.strip()) instead of len(line)?
orip
@orip: That would be a bug, thank you.
S.Lott
@S.Lott: Comment using the non-word "resave" deleted. Use case added.
fwkb
Thanks! I will definitely put that to use!
fwkb
@fwkb: Stack Overflow maintains it's own change history, saving you from having to track changes via extra comments. You can simply make changes and not worry about leaving some kind of audit trail. It's already tracked.
S.Lott
A: 

A RegEx isn't always the best tool for the job. How about running it through something like "Split" or "Tokenize" first? (I'm sure python has an equivalent) Then you have your records and can assume newlines are just continuations.

Eric Nicholson