views:

87

answers:

3

Hello all!

This question is the result of two other questions I've asked in the last few days.
I'm creating a new question because I think it's related to the "next step" in my understanding of how to control the flow of my send/receive, something I didn't get a full answer to yet.
The other related questions are:
http://stackoverflow.com/questions/3028376/an-iocp-documentation-interpretation-question-buffer-ownership-ambiguity
http://stackoverflow.com/questions/3028998/non-blocking-tcp-buffer-issues

In summary, I'm using Windows I/O Completion Ports.
I have several threads that process notifications from the completion port.
I believe the question is platform-independent and would have the same answer as if to do the same thing on a *nix, *BSD, Solaris system.

So, I need to have my own flow control system. Fine.
So I send send and send, a lot. How do I know when to start queueing the sends, as the receiver side is limited to X amount?

Let's take an example (closest thing to my question): FTP protocol.
I have two servers; One is on a 100Mb link and the other is on a 10Mb link.
I order the 100Mb one to send to the other one (the 10Mb linked one) a 1GB file. It finishes with an average transfer rate of 1.25MB/s.
How did the sender (the 100Mb linked one) knew when to hold the sending, so the slower one wouldn't be flooded? (In this case the "to-be-sent" queue is the actual file on the hard-disk).

Another way to ask this:
Can I get a "hold-your-sendings" notification from the remote side? Is it built-in in TCP or the so called "reliable network protocol" needs me to do so?

I could of course limit my sendings to a fixed number of bytes but that simply doesn't sound right to me.

Again, I have a loop with many sends to a remote server, and at some point, within that loop I'll have to determine if I should queue that send or I can pass it on to the transport layer (TCP).
How do I do that? What would you do? Of course that when I get a completion notification from IOCP that the send was done I'll issue other pending sends, that's clear.

Another design question related to this:
Since I am to use a custom buffers with a send queue, and these buffers are being freed to be reused (thus not using the "delete" keyword) when a "send-done" notification has been arrived, I'll have to use a mutual exlusion on that buffer pool.
Using a mutex slows things down, so I've been thinking; Why not have each thread have its own buffers pool, thus accessing it , at least when getting the required buffers for a send operation, will require no mutex, because it belongs to that thread only.
The buffers pool is located at the thread local storage (TLS) level.
No mutual pool implies no lock needed, implies faster operations BUT also implies more memory used by the app, because even if one thread already allocated 1000 buffers, the other one that is sending right now and need 1000 buffers to send something will need to allocated these to its own.

Another issue:
Say I have buffers A, B, C in the "to-be-sent" queue.
Then I get a completion notification that tells me that the receiver got 10 out of 15 bytes. Should I re-send from the relative offset of the buffer, or will TCP handle it for me, i.e complete the sending? And if I should, can I be assured that this buffer is the "next-to-be-sent" one in the queue or could it be buffer B for example?

This is a long question and I hope none got hurt (:

I'd loveeee to see someone takes the time to answer here. I promise I'll double-vote for him! (:
Thank you all!

+1  A: 

Q1. Most APIs will give you "write is possible" event, after you last wrote and writing is available again (can happen immediately if you failed to fill major part of send buffer with the last send).

With completion port, it will arrive just as "new data" event. Think of new data as "read Ok", so there's also a "write ok" event. Names differ between the APIs.

Q2. If a kernel mode transition for mutex acquisition per chunk of data hurts you, I recommend rethinking what you are doing. It takes 3 microseconds at most, while your thread scheduler slice may be as big as 60 milliseconds on windows.

It may hurt in extreme cases. If you think you are programming extreme communications, please ask again, and I promise to tell you all about it.

Pavel Radzivilovsky
Haha - yes, I'm programming extreme communications here. Very extreme. Ok, let's leave the "pool-per-thread" thing aside for a moment. I still don't get the first part of your answer though. If talking Windows IOCP specific, I get notifications of "write done" and "read done" type. It's called "proactor" pattern. So, since I don't get a "ready-to-write" notification, how should I know when I pushed the limits of the remote side (to get data)?
Poni
Berkeley sockets, for instance, give you "buffer ready". IOCP, as far as I remember (and I don't remember that good) would send you 'write done' whenever socket's send buffer was loaded so you can schedule another transfer to the send buffer, which will not start until the buffer reaches some low watermark. The copy will always finish soon after it is started. So, you should wait for the event and send practically as much as you like. Sizes between 2 to 8 kb sound about optimal in terms of number of events per data.
Pavel Radzivilovsky
About extreme communications. A few questions. First and foremost, why multithreaded design? Second - what are you exactly doing? What are your bandwidth and latency goals, and latency probability distribution you are trying to achieve and why? How many users do you have and what are the assumptions on router capabilities? (Like, identification of TCP sockets for proper balancing, etc). How is the data generated, and whether it is on the same thread?
Pavel Radzivilovsky
Regarding your first comment: Let's see if I understood you correctly; You're saying I should pass to the trasport layer (TCP) one packet per time, and send a next one only after got a "write done" notification?
Poni
As for your second comment: Multi-threaded design because I need to handle hundreds/thousands of clients. Each might ask for different thing. Other threads might send stuff (regardless if the clients asked for it or not, see following text). I have a game server broke into two types of game servers; a terminal server where players actually connect to, and a "game server" where the terminal connects to and passes actions from/to players (it sits between the player at home and the game server who holds the game state).
Poni
What really concerns me is the communication between the terminal and the game servers. They're connected through a switch with a 100Mb link each. Now, the game is event-driven, both by the sockets and a thread pool that might change the game state. So for example, a player did action X. The game's state changes thus the game server will notify everyone else in the game about that event. This means that the game server will send many packets to the terminal server who will send these packets to the relative players.
Poni
Now imagine I have 5 game servers and 30 terminal servers. This is a lot of traffic, thus I'd say extreme.
Poni
IOCP based async I/O on Windows doesn't have a 'you can write' notification just a 'your write has been dealt with and I don't need your data buffers anymore' notification; thus the reason for the question.
Len Holgate
@Poni: Instead of threading, the best way to handle "extreme" networking is to use non-blocking I/O exclusively, and then use multiple processes to get the benefits of extra processors.
kyoryu
[multithreading] It's not convincing. Your application is either IO-bound or CPU-bound. If it is indeed IO-bound, you will not gain from multithreading. On the contrary, the thread scheduler is bound to do worse job than what you would have done manually. You'd rather design a heartbeat cycle in one thread, which would see what sockets are writable and carry out the respective action. No sync, no locks, better performance.
Pavel Radzivilovsky
[event processing] Yes, though 'packet' is not the right word. TCP API does not talk in terms of packets. Perhaps I should clarify, due to k's input. The "end of write" event has a meaning. It's not intuitive for TCP communications, which underneath have windowing, acknowledges and slowstart. In the case you described, it was meant for one thing: the socket is ready for more, and "now it's the optimal timing". It does not indicate acknowledgement of the data by the other side (unfortunately you cannot obtain this info at all), or TCP window empty/freeze. It is for the exact applicative purpose
Pavel Radzivilovsky
A) I think you meant that if my app is IO-bound I WILL gain from multi-threading, right? B) I know the term "packet" is not the right word here yet I think you got my idea. C) Can you explain the "No sync, no locks, better performance" line? Can't really get this one. Thank you Pavel, I really appriciate your support!
Poni
A) No. Multi-threading is only useful if you are limited by CPU load. Then, you use multi-threading to split the load between different CPUs. In other cases, such as when you are limited by IO, it is considered harmful to performance, mainly because of the synchronization tools - but not only.
Pavel Radzivilovsky
+1  A: 

To address your question about when it knew to slow down, you seem to lack an understanding of TCP congestion mechanisms. "Slow start" is what you're talking about, but it's not quite how you've worded it. Slow start is exactly that -- starts off slow, and gets faster, up to as fast as the other end is willing to go, wire line speed, whatever.

With respect to the rest of your question, Pavel's answer should suffice.

jer
Hi jer, see my comment to Pavel's answer. How would you tell what is fast, and when you can push more data to the remote side?
Poni
Whilst you don't need to worry about these issues with blocking socket calls as they simply block when the stack doesn't want to accept more data you DO need to be aware of the issues when using async calls if the OS lets you post "any number" of async writes... The writes will take longer and longer to complete as the TCP stack deals with congestion and flow control and whilst the writes are pending you're using up resources.
Len Holgate
+1  A: 

Firstly: I'd ask this as separate questions. You're more likely to get answers that way.

I've spoken about most of this on my blog: http://www.lenholgate.com but then since you've already emailed me to say that you read my blog you know that...

The TCP flow control issue is such that since you are posting asynchronous writes and these each use resources until they complete. During the time that the write is pending there are various resource usage issues to be aware of and the use of your data buffer is the least important of them; you'll also use up some non-paged pool which is a finite resource (though there is much more available in Vista and later than previous operating systems), you'll also be locking pages in memory for the duration of the write and there's a limit to the total number of pages that the OS can lock. Note that both the non-paged pool usage and page locking issues aren't something that's documented very well anywhere, but you'll start seeing writes fail with ENOBUFS once you hit them.

Due to these issues it's not wise to have an uncontrolled number of writes pending. If you are sending a large amount of data and you have a no application level flow control then you need to be aware that if you send data faster than it can be processed by the other end of the connection, or faster than the link speed, then you will begin to use up lots and lots of the above resources as your writes take longer to complete due to TCP flow control and windowing issues. You don't get these problems with blocking socket code as the write calls simply block when the TCP stack can't write any more due to flow control issues; with async writes the writes complete and are then pending. With blocking code the blocking deals with your flow control for you; with async writes you could continue to loop and more and more data which is all just waiting to be sent by the TCP stack...

Anyway, because of this, with async I/O on Windows you should ALWAYS have some form of explicit flow control. So, you either add application level flow control to your protocol, using an ACK, perhaps, so that you know when the data has reached the other side and only allow a certain amount to be outstanding at any one time OR if you cant add to the application level protocol, you can drive things by using your write completions. The trick is to allow a certain number of outstanding write completions per connection and to queue the data (or just don't generate it) once you have reached your limit. Then as each write completes you can generate a new write....

Your question about pooling the data buffers is, IMHO, premature optimisation on your part right now. Get to the point where your system is working properly and you have profiled your system and found that the contention on your buffer pool is the most important hot spot and THEN address it. I found that per thread buffer pools didn't work so well as the distribution of allocations and frees across threads tends not to be as balanced as you'd need to that to work. I've spoken about this more on my blog: http://www.lenholgate.com/archives/000903.html

Your question about partial write completions (you send 100 bytes and the completion comes back and says that you have only sent 95) isn't really a problem in practice IMHO. If you get to this position and have more than the one outstanding write then there's nothing you can do, the subsequent writes may well work and you'll have bytes missing from what you expected to send; BUT a) I've never seen this happen unless you have already hit the resource problems that I detail above and b) there's nothing you can do if you have already posted more writes on that connection so simply abort the connection - note that this is why I always profile my networking systems on the hardware that they will run on and I tend to place limits in MY code to prevent the OS resource limits ever being reached (bad drivers on pre Vista operating systems often blue screen the box if they can't get non paged pool so you can bring a box down if you don't pay careful attention to these details).

Separate questions next time, please.

Len Holgate
Thank you Len for the feedback - means a lot to me! I'm having hard time to get the whole idea as this is a real "pandora" box it seems (: . I think you meant to http://www.lenholgate.com/archives/000903.html , regarding the per-thread pool idea, right? Now lets see if I get you right; Since I have a server which will generate a lot of traffic yet is bound by the link, I basically have two options: 1 - Use the blocking send() call (I deter from this, and the test I've just made shows I better stay away from it).
Poni
2 - I better use a "to-send-queue". When a write is done for one socket I'll post the next buffer to it to be sent. The question here is, how much should I post at a time? Fixed number of buffers doesn't sound right.
Poni
In other words, just to clarify, you're basically saying that there's no generic configuration. Ok. Then again it brings me to the question; How do I know I maxed the pipe with data? Send them one by one? That's one thing I don't get (silly me!). And of course, as per my specific application which generates many packets - are you saying I'll have to block the "to-be-sent" buffers generation because it's not healthy to lock so much memory for these buffers - will I have to block my loop of sending (the one that queues) somehow? How do you suggest I do that, in case you agree with me so far?
Poni
Simply spin around the m_buffers_pool.get() call until I get a buffer, and only then post it to the queue? ....... yup, many questions arise, yet once this is solved the sky is the limit.. And of course - as for "Separate questions next time, please" - note taken! (:
Poni
Fix to my first comment: You meant to this page: http://www.lenholgate.com/archives/000898.html .
Poni
You know that you have maxed the connection to the peer when your write completions start to take longer to occur. You decide how many you want outstanding and never post more until some of the outstanding ones complete. The pipe then stays full. You drive your sending from the completions of the writes.
Len Holgate