I have two debian boxes connected by a CX4 cable going between two 10 GbE cards. One is going to be generating data very quickly (between 4Gbits/s and 16Gbits/s), and the other needs to be able to grab all of that and store it in RAM for later parsing. I'm new to this kind of low-level coding, and would happily accept any ideas about what broad approach to use (do I need DMA? RDMA?), or tips and tricks that might apply. Thanks!
The only nics I've heard of available for ordinary PCs that'll handle pulling a saturated 10GbE up to userspace for any kind of post processing are the ones manufactured by Napatech - you'll have to use their custom API.
And you better put such a card a pretty grown up server with the bus plumbing to support that kind of speed(I'd certainly steer away from any kind of nvidia chipsets for such a box.)
If you want to constantly process 1 GB of traffic a second you need a very wide buss and a very fast processing rate, and my experience comes from NIDS. You need specialized hardware to consistantly perform NIDS processing 100MB (1 Gig ethernet) of data (10 Gb is another universe). Ram will not help you because you can fill a GB in 5-10 seconds and 1 GB holds a lot of requests.
If you are trying to do any form of business or web processing with 10 gig, you probably need to put a load distributer that can keep up with 10GB of traffic at the front.
p.s., I must clarify that NIDS is 1:1 traffic processed on the machine that sees the traffic -- i.e, worst case scenario you process every byte on the same machine; whereas business/web processing is 1:many: many machines and an order of magnitude many bytes to process.
-- edit --
Now that you have mentioned that there is a gap between data delivery (no standard 10Gb nic can keep up with 10Gb anyway), we need to know what the content of the processing is before we can make suggestion.
-- edit 2 --
Berkeley DB (a database with a simple data model) behaves like a enterprise database (in terms of transaction rate) when you use multiple threads. If you want to write to disk at high rates you should probably explore this solution. You probably want a raid setup to boost throughput -- raid 0+1 is best in terms of IO throughput and protection.
Well, you're going to need money. One way might be to buy a load-sharing switch to split incoming data into two computers and post-process them into a single database.
Because you have some aspects that simplify the situation (steady point to point between only two machines, no processing) I would actually try to trivial or obvious method of a single TCP stream between the systems and writing the data using write()
to disk. Then measure the performance, and profile to determine where any bottlenecks are.
For starting point, read about the C10K (10000 simultaneous connections) problem, which is what most high performance servers are developed for. It should give you a strong background of high performance server issues. Of course you don't need to worry about select / poll / epoll for establishing new connections, which is a major simplification.
Before you plan on any special programming, you should do some testing to see how much you can process with a vanilla system. Set up a mock data file and sending process on the producer machine and a simple accepter/parser on the consumer machine and do a bunch of profiling - where are you going to run into data problems? Can you throw better hardware at it, or can you tweak your processing to be faster?
Be sure you are starting with a HW platform that can support the data rates you are expecting? If you're working with something like Intel's 82598EB NIC, make sure you've got it plugged into a PCIe 2.0 slot, preferably in a x16 slot, in order to get full bandwidth from the NIC to the chipset.
There are ways to tune the NIC driver's parameters to your datastream to get the most out of your setup. For example, be sure you are using jumbo frames on the link in order to minimize the TCP overhead. Also, you might play with the driver's interrupt throttle rates to speed the low level handling.
Is the processing for your dataset parallelizable? If you have one task dumping the data into memory, can you set up several more tasks to processes chunks of the data simultaneously? This would make good use of multi-core CPUs.
Lastly, if none of this is enough, use the profiling/timing data that you've gathered to find the parts of the system that you can adjust for better performance. Don't just assume you know where you need to tweak: back it up with real data - you may be surprised.
I think the recent linux kernel has supported 10Gb packet from nic->kernel but I doubt that there is effiecent way to copy the data to user space even play with i7/XEON 5500 platform