views:

150

answers:

6

I have a 1GB binary file on another system.

Requirement: ftp/download and convert binary to CSV on main system.

The converted file will be magnitudes larger ~ 8GB

What is the most common way of doing something similar to this?
Should this be a two step independent process, download - then convert?
Should I download small chunks at a time and convert while downloading?

I don't know the most efficient way to do this...also what should I be cautions of with files this size?

Any advice is appreciated.

Thank You.

(Visual Studio C++)

+2  A: 

It depends on your data and your requirements. What performance requirements do you have? Do you need to finish such as task in X amount of time (where speed is critical), or is this something that will just be done periodically (in which case speed is not essential)?

That said, you will certainly get a cleaner implementation if you separate the work out into two tasks - a downloader and a converter. That way each component can be simple and just focus on the task at hand. All things being equal, I recommend this approach.

Otherwise if you try to download/convert at the same time you may get into situations where your downloader has data ready, but the converter needs more data before it can proceed. Again, there is no reason why your code cannot handle this, but it will make the implementation more complicated and that much more difficult to debug / test / validate.

Justin Ethier
@Justin: Is having two separate task but having the conversion start after enough has been downloaded considered "download/convert at the same time"? thanks.
Tommy
+4  A: 

Without knowing any specifics, I would go with a binary ftp download and then post-process with a separate conversion program. This would break the process into two distinct and unrelated parts which would aid in building and debugging the overall system. No need to reinvent an FTP system and lots of potential to optimize the post-processing.

Peter M
+1  A: 

It's usually better to do it as separate processes with no interdependency. If your requirements change in the future you can reuse the pieces, or use them for other projects.

Jay
+3  A: 

To avoid too much traffic I would in a first step compress and transfer the file. The conversion process, if something goes wrong or want another output can be redone locally without refetching the data.

The only precaution is not to load the whole stuff in memory and then convert but do it chunk-wise like you said. You can prevent some unpleasant effects for users of your program by creating/pre-allocating a huge file of the max expected size. This to avoid running out of disk space during the conversion phase. Also some filesystems do not like files bigger than 2GB or 4GB, that would also be caught by the pre-allocation trick.

jdehaan
+5  A: 

I would write a program that converts the binary format and outputs to CSV format. This program would read from stdin and write to stdout.

Then I would call

wget URL_of_remote_binary_file --output-document=- | my_converter_program > output_file.csv

That way you can start converting immediately (without downloading the entire files) and your program doesn't deal with networking. You can also run the program on the remote side, assuming it's portable enough.

Nico
@Nico - I am not sure I completely understand how to start converting immediately without downloading the entire file. I like this idea, can you please elaborate?
Tommy
@Tommy, `wget` won't read the entire file before it starts writing, it will write as soon as it gets a reasonable chunk of the file. The pipe mechanism should pass that along to your converter program immediately when it becomes available. This is very typical *nix thinking.
Mark Ransom
@Mark- Is there a way to emulate this in Windows without downloading any new libraries?
Tommy
@Tommy, piping is available in Windows in a command window. You can Google for Windows wget, or if downloading a program isn't an option you can write your own file transfer program using FTP.
Mark Ransom
+1  A: 

Here are even more guesses about your requirements and possible solutions:

  • Concerned about file integrity? Implement something that includes integrity checks such as sequence numbers, size fields and checksums/hashes, and just enough transaction semantics so that the system knows whether a transfer completed or didn't.
  • Are uploads happening on slow/congested links, and may be interrrupted? Implement a protocol that allows the transfer to resume after interruption.
  • Are uploads recurring, with much of the data unchanged? Implement something amenable to incremental update, so you upload only the differences.
Liudvikas Bukys
Good questions, have not got that far yet but yes on integrity.
Tommy