views:

308

answers:

3

I have a file wich has about 12 millon lines, each line looks like this:

0701648016480002020000002030300000200907242058CRLF

What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.

The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I haven´t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).

Right now I'm trying a solution in python that looks something like this.

file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()

I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.

Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.

thanks

+4  A: 

oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:

file = open('file.cas', 'r')
try:
    output = open('output.cas', 'w')
    try:
        output.writelines('%d %s' % tpl for tpl in enumerate(file))
    finally:
        output.close()
finally:
    file.close()

That uses a generator expression which runs through the file processing one line at a time.

David Zaslavsky
haha thank you. Is there a way to read a file in python by parts?, or work in one in a more low level manner. This is way i was thinking about doing it in C.Thanks
Alan FL
David's method does read the file by parts - it reads and writes one line at a time.
Andy Balaam
Yeah, Andy's right - though the fact that it reads one line at a time is deeply disguised in Python voodoo ;-) Or at least it looks like voodoo if you're not used to it.
David Zaslavsky
I feel a great disturbance in the force, as if a million RAM chips wept in horror, and were suddenly silenced.
Stefano Borini
haha.. wow i'm always amazed by python... i think i sort of get it, any ideas how i could change it so i could get the numbers in a fixed lenght? thanks a lot btw
Alan FL
i think for now i'll use your solution, it doesn't take too long and now that i sort of get how it works might be more durable.Thanks a lot!
Alan FL
For a fixed length, I think it's `%12d` for 12 places. (Or %9d for 9 places, etc.)
David Zaslavsky
+2  A: 

Why don't you try cat -n ?

Stefano Borini
i will definitely try it, but for now i can't use cat.
Alan FL
ok, then the solution David proposes is good, albeit as plafayette points out, it could be slower than cat.
Stefano Borini
i just finish making few tests with a smaller sample (2M records), and both solutions (yours and david's) and both take the exact time to run. Tomorrow I'll test what happens with 12M but im not expecting much of a difference.thank you
Alan FL
this is interesting... means that python is fast as C (doesn't surprise me, but it's nice to see it in practice)
Stefano Borini
+2  A: 

Stefano is right:

$ time cat -n file.cas > output.cas

Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.

Pierre-Antoine LaFayette