tags:

views:

1162

answers:

7

What's the difference - performance-wise - between reading from a socket 1 byte a time vs reading in large chunk?

I have a C++ application that needs to pull pages from a web server and parse the received page line by line. Currently, I'm reading 1 byte at a time until I encounter a CRLF or the max of 1024 bytes is reached.

If reading in large chunk(e.g. 1024 bytes at a time) is a lot better performance-wise, any idea on how to achieve the same behavior I currently have (i.e. being able to store and process 1 html line at a time - until the CRLF without consuming the succeeding bytes yet)?

EDIT:

I can't afford too big buffers. I'm in a very tight code budget as the application is used in an embedded device. I prefer keeping only one fixed-size buffer, preferrably to hold one html line at a time. This makes my parsing and other processing easy as I am by anytime I try to access the buffer for parsing, I can assume that I'm processing one complete html line.

Thanks.

+1  A: 

First and simplest:

cin.getline(buffer,1024);

Second, usually all IO is buffered so you don't need to worry too much

Third, CGI process start usually costs much more then input processing (unless it is huge file)... So you may just not think about it.

Artyom
+1  A: 

G'day,

One of the big performance hits by doing it one byte at a time is that your context is going from user time into system time over and over. And over. Not efficient at all.

Grabbing one big chunk, typically up to an MTU size, is measurably more efficient.

Why not scan the content into a vector and iterate over that looking out for \n's to separate your input into lines of web input?

HTH

cheers,

Rob Wells
Yes, depending on the number of calls, the relative overhead caused by function calls may actually become significant at some point.
none
+4  A: 

I can't comment on C++, but from other platforms - yes, this can make a big difference; particularly in the amount of switches the code needs to do, and the number of times it needs to worry about the async nature of streams etc.

But the real test is, of course, to profile it. Why not write a basic app that churns through an arbitrary file using both approaches, and test it for some typical files... the effect is usually startling, if the code is IO bound. If the files are small and most of your app runtime is spent processing the data once it is in memory, you aren't likely to notice any difference.

Marc Gravell
+1  A: 

You are not reading one byte at a time from a socket, you are reading one byte at a atime from the C/C++ I/O system, which if you are using CGI will have alreadety buffered up all the input from the socket. The whole point of buffered I/O is to make the data available to the programmer in a way that is convenient for them to process, so if you want to process one byte at a time, go ahead.

Edit: On reflection, it is not clear from your question if you are implementing CGI or just using it. You could clarify this by posting a code snippet which indicates how you currently read read that single byte.

If you are reading the socket directly, then you should simply read the entire response to the GET into a buffer and then process it. This has numerous advantages, including performance and ease of coding.

If you are linitted to a small buffer, then use classic buffering algorithms like:

getbyte:
   if buffer is empty
      fill buffer
      set buffer pointer to start of buffer
   end
   get byte at buffer pointer
   increment pointer
anon
Nope. I'm reading from a socket. I'm making HTTP GET request to the web server and reads the response from a socket. I do this because I need the completely rendered and parsed dynamic content.
teriz
Think I could settle with this algorithm with a little modification. I can have two fixed size buffer. One to read an entire (say 512 bytes), scan it and store a single complete html line on another buffer which I could access easily in my other parsing methods. I could have a more efficient socket reading routine and I could keep the ease of processing I have right now (i.e my other methods assuming one complete html line). Thanks. =)
teriz
+3  A: 

If you are reading directly from the socket, and not from an intermediate higher-level representation that can be buffered, then without any possible doubt, it is just better to read completely the 1024 bytes, put them in RAM in a buffer, and then parse the data from the RAM.

Why? Reading on a socket is a system call, and it causes a context switch on each read, which is expensive. Read more about it: IBM Tech Lib: Boost socket performances

NicDumZ
+1 - I like your argument on why reading in large chunk is better performance-wise. I think I can settle for Neil Butterworth's answer for resolving my second concern. =)
teriz
A: 

There is no difference at the operating system level, data are buffered anyway. Your application, however, must execute more code to "read" bytes one at a time.

+1  A: 

You can open the socket file descritpor with the fdopen() function. Then you have buffered IO so you can call fgets() or similar on that descriptor.

codymanix
-1 for suggesting gets().
bk1e
sorry, I meant fgets(), edited my answer now :-(
codymanix
How could you!!
LukeN