Hmm, an interesting problem. I hope you like reading... :-)
I'd be interested to know how the monitoring tool would be used. At
work, the sysadmins just have a couple of large screens in the room,
showing a webpage containing loads of network stats, with it constantly
updating.
The rest of my description assumes the network monitoring tool would be
used as described above. If you just want to be able to do an ad-hoc
test between two random hosts on your network, I'd just use rsync to
transfer a reasonably large file (about 1 - 2MB). I'm sure there are
other file transfer tools that calculate the transfer speed too.
When implementing this, (especially within a large network) you must
minimise the risk that the test floods the network, hampering the people
(or programs) actually using it. You don't want to be blamed for a
massive slowdown (or worse, an outage) just because you were conducting
a test. Your sysadmins won't thank you...
I'd architect the tool in the following way:
Bob is a server which participates in an individual 'test' by doing
the following:
- Bob receives a request from a client. The request states how much data the client is about to send.
- If the amount of data proposed to be sent is not too large, wait for the data. Otherwise Bob rejects the request immediately and ends the communication.
- Once the required number of bytes has been received, reply with the amount of time it took to receive it all. Bob terminates the communication.
Alice is the component that displays the result of the measurements
taken (via a webpage or otherwise). Alice is a long lived process
(maybe a web server), configured to periodically connect to a list of
Bob servers. For each configured Bob:
- Send Bob a request with the amount of data Alice is about to
send.
- Send Bob the specified amount of data, as fast as possible.
- Await the reply from Bob, and compute the network speed.
- 'Display' the result for this instance of Bob. You may choose
to display an aggregate result. For example, the average result for
each of the last 20 tests, to iron out any anomalies...
When conducting a given test, Alice should report any failures. E.g.
'a TCP connection could not be established with Bob', or 'Bob
prematurely terminated the transfer' or whatever else...
Scatter Bob servers to strategic locations in your (possibly large)
network, and configure Alice to go them. For each instance of Bob, you
should configure
- The time interval in between tests.
- The 'leeway' (I'll explain this in a bit).
- The amount of data to send to Bob for each test.
- Bob's address (duh).
You want to 'stagger' the tests that a given Alice will attempt. You
don't want Alice to trigger the test to all Bob servers at once, thereby
flooding your network, possibly giving skewed results and so forth.
Allow the test to occur at a randomised time in the future. For
example, if the test interval is every 10 minutes, configure a 'leeway'
of 1 minute, meaning the next test might occur anywhere between 9 and 11
minutes' time.
If there is to be more than one Alice running at a time, the total
number of instances should be small. The more Alices you have, the more
you interfere with the network. Again, you don't want to be responsible
for an outage.
The amount of data Alice should send in an individual test should be
small. 500KB? You probably want a given test to run for no more than
10 seconds. Maybe get Bob to timeout if the test takes too long.
I've deliberately omitted the transport to use (TCP, UDP, whatever)
because you'll get issues depending on the transport, and I don't know
how you want to handle those issues. For example, you'd have to
consider how to handle dropped datagrams with UDP. What result would
you compute? You don't get this issue with TCP, because it
automatically retransmits dropped packets. With TCP, your
throughput will be artificially low if the two endpoints
are far away from each other. Here's some
info on it.
If you had the patience to read this far, I hope it helped!