views:

65

answers:

3

Hi everyone ,

I had to read from a file and for each data between delimiter i need to remove the white space and i have written the following program in jython

When i am trying to rewrite ,its rewriting at the end of source file.

filesrc = open('c:/FILE/split_doc.txt','r+')
for list in filesrc.readlines():
    #split the records by the delimiter
    fields = list.split(',')
    list = ",".join([s.strip() for s in fields])
    filesrc.writelines(list+"\n")

filesrc.close()

So i did some modification and added file.seek so I can rewrite on the source lines and it worked to some extend except it was adding two extra lines at the end which means some issue with seek part.

The modified program is

filesrc = open('c:/ODI_FILE/split_doc.txt','r+')
lines=0
for list in filesrc.readlines():
    #split the records by the delimiter
        fields = list.split(',')
    list = ",".join([s.strip() for s in fields])
    filesrc.seek(lines)
    filesrc.writelines(list+"\n")
    lines += len(list+"\n")

filesrc.close()

Please help me with the correct logic.

The correct source file with extra white spaces

52       ,William   ,Kudo       ,28/03/199300:00:00
11,Andrew,      Andersen,22/02/199900:00:00
12,John        ,Galagers,20/04/200000:00:00
13,Jeffrey        ,Jeferson,10/06/198800:00:00
20,Jennie,Daumesnil,28/02/198800:00:00
21,Steve,Barrot,24/09/199200:00:00
22,Mary,Carlin,14/03/199500:00:00
30,Paul,Moore,11/03/199900:00:00

This is my wrong output

52,William,Kudo,28/03/199300:00:00
11,Andrew,Andersen,22/02/199900:00:00
12,John,Galagers,20/04/200000:00:00
13,Jeffrey,Jeferson,10/06/198800:00:00
20,Jennie,Daumesnil,28/02/198800:00:00
21,Steve,Barrot,24/09/199200:00:00
22,Mary,Carlin,14/03/199500:00:00
30,Paul,Moore,11/03/199900:00:00
9500:00:00
30,Paul,Moore,11/03/199900:00:00

here the last two lines should not have come

Please suggest the required and faster way as this is a sample file and i would have to have make this program work for millions of rows.

Is there is way to make this logic work with while loop too ?

A: 

You are overwriting as you go, but your final results are shorter than the original, so you are getting the last X characters of the original bleeding through, where X is the difference in size from the original to the new version. The extra .seek() and truncate() calls in this version will seek to the end of your new output and cut off the rest of the file.

filesrc = open('c:/ODI_FILE/split_doc.txt','r+')
lines=0
for list in filesrc.readlines():
    #split the records by the delimiter
        fields = list.split(',')
    list = ",".join([s.strip() for s in fields])
    filesrc.seek(lines)
    filesrc.writelines(list+"\n")
    lines += len(list+"\n")
filesrc.seek(lines)
filesrc.truncate()
filesrc.close()
teepark
THANKS SO MUCH IT WORKED . i am facing another issue now , i get java out of memory error when dealing with 500,000 rows and i have changed to 512mx but still it fails . i actually had the same issue while using for loop with another program when i used the while loop it worked . Is it possible to change the program for while loop ,Thanks again so much for your prompt help
kdev
readlines() will read the entire contents into a list in memory. The problem with using an iterator instead is that you are seek()ing in the same file which I suspect will cause problems with an iterator. To use a while loop, you will need two pointers into your file and seek between them. Is it feasible to read from one file and write to another? That would simplify your task.
Mark Peters
Initially i did write in another file but later i realized that my requirement is mostly to use the same file name which is coming and its not possible for me to change the file name so i have to read and write into the same file.
kdev
You DON'T need to read and write into the same file. As others have pointed out, it is perilous. Consider what happens if power fails. DON'T read the whole file into memory. Rename old file to have a name that includes a timestamp, read old file, write/flush/close new file. Delete the backup file much later when you are sure you don't need it any more.
John Machin
There might be perils overwriting the same file when you aren't reading the whole file into memory (reading the next chunk gives you some of the new version from the last iteration), so either one of those improvements could be perilous, but doing them *both* is certainly a much better approach. Operate on one chunk at a time, outputting to a new file, then copy the new file to the old one's location.
teepark
+1  A: 

You don't want to write to the same file while you're reading it. It's technically possible, but that path is fraught with trouble and misery.

Here's the plain and simple process you should follow:

  • read the whole file into a string then close the file
  • split the string on newlines into a list
  • process each line to remove extra spacing
  • rejoin the list into a string
  • overwrite the source file with the new cleaned data

If you don't want to load the whole file into memory at once, then try this process:

  • open the file for reading
  • read line by line
  • write cleaned lines to a new temp output file
  • when all lines are written, delete the original file
  • rename temp file to original name

My recommendation is to write it both ways and see what works or doesn't work and which way is faster, rather than assume you can't read it all into memory just because it is millions of lines. Maybe it will work just fine.

Also, you can certainly make this work with a while loop as well. To do so, you will want to read the Python docs on the form of a while loop and do some experiments. How you write that loop will depend on how you loaded the file: all at once into a string and then split into a list, or line by line directly from the file. For either case, how do you know how much work the while loop will have to do, how will you advance from one piece of work to the next, and how will you know when its done? If you can answer these, you can write your loop.

Todd
tHANKS FOR THE SUGGESTION , I WILL TRY TO WORK AROUND THE WAY
kdev
A: 

This does not answer your question, but have you considered not doing this with jython?

Tried with Sed?

Peter Lang