ansaurus

Question

What's the best way to read and parse a large text file over the network?

Answer 1

+2 A:

If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.

Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.

Your other great option is to put the log parser on the server, and download the results.

Wesley Tarle 2008-09-26 00:23:12

If copying a 100mb text file directly over the network takes x seconds , and having a remote client compress and send the file and then deflating/reading takes x/4 seconds, isnt that worth it? (Note, I dont actually know how long it would take to compress/send/decompress/read)

midas06 2008-09-26 00:57:14

By all means you can (and should) use some compression over the network. Like I said, review your IO streaming options -- some guys suggested some zip libraries. OTOH if you can put a program on the remote end, do the processing there!

Wesley Tarle 2008-09-26 02:07:13

Answer 2

+1 A:

The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.

The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.

Luke 2008-09-26 00:26:52

I would love to compress it, but if my code is running on the local machine, it would be compressed after being transferred, which defeats the purpose. I'm thinking ill end up having to write a clienttthat does nothing but compress and send.

midas06 2008-09-26 00:53:44

Answer 3

A:

I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.

Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.

Chris Tybur 2008-09-26 00:28:09

Answer 4

A:

i think using compression (deflate/gzip) would help

CiNN 2008-09-26 00:29:15

Answer 5

+1 A:

I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.

But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).

Multithreading won't help, as you'll be disk or network bound anyway.

Mark Brackett 2008-09-26 01:04:41

Answer 6

+1 A:

Use compression for transfer.

If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.

SquareCog 2008-09-26 01:21:10

Answer 7

+1 A:

The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.

This is one of the reasons that so many databases use stored procedures that are run at the server end.

Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.

If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer. There are, for example, sftp servers available that will perform compression on the fly. At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.

Andrew Edgecombe 2008-09-26 01:44:24

Answer 8

+1 A:

If you can copy the file, you can read it. So there's no need to copy it in the first place.

EDIT: use the FileStream class to have more control over the access and sharing modes.

new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)

should do the trick.

VVS 2008-09-26 07:36:13

I beg to differ there. It's been my experience that copying an in use will work when attempting to parse through it in a stream will not. My theory is that copy uses some other windows api that allows it.

midas06 2008-09-26 07:39:24

Your theory is wrong, imho. Windows Explorer uses the same API .NET (and FileStream) uses. Did you try it?

VVS 2008-09-26 14:57:00

ansaurus

tags:

views:

answers:

What's the best way to read and parse a large text file over the network?

related questions