views:

22

answers:

1

I've tried it these ways so far:

1) Make a hash with the source IP/port and destination IP/port as keys. Each position in the hash is a list of packets. The hash is then saved in a file, with each flow separated by some special characters/line. Problem: Not enough memory for large traces.

2) Make a hash with the same key as above, but only keep in memory the file handles. Each packet is then put into the hash[key] that points to the right file. Problems: Too many flows/files (~200k) and it might run out of memory as well.

3) Hash the source IP/port and destination IP/port, then put the info inside a file. The difference between 2 and 3 is that here the files are opened and closed for each operation, so I don't have to worry about running out of memory because I opened too many at the same time. Problems: WAY too slow, same number of files as 2 so also impractical.

4) Make a hash of the source IP/port pairs and then iterate over the whole trace for each flow. Take the packets that are part of that flow and place them into the output file. Problem: Suppose I have a 60 MB trace that has 200k flows. This way, I would process, say, a 60 MB file 200k times. Maybe removing the packets as I iterate would make it not so painful, but so far I'm not sure this would be a good solution.

5) Split them by IP source/destination and then create a single file for each one, separating the flows by special characters. Still too many files (+50k).

Right now I'm using Ruby to do it, which might've been a bad idea, I guess. Currently I've filtered the traces with tshark so that they only have relevant info, so I can't really make them any smaller.

I thought about loading everything in memory as described in 1) using C#/Java/C++, but I was wondering if there wouldn't be a better approach here, especially since I might also run out of memory later on even with a more efficient language if I have to use larger traces.

In summary, the problem I'm facing is that I either have too many files or that I run out of memory.

I've also tried searching for some tool to filter the info, but I don't think there is one. The ones I've found only return some statistics and wouldn't scan for every flow as I need.

+1  A: 

Given your scenario, I might write the traces to files, but use an LRU (least-recently-used) caching mechanism to keep a limited number of files open at one time. If you need to access a file that isn't currently open, close the file that hasn't seen any activity the longest, and open the current file.

You may need to tune the number of files in your LRU cache in order to get the best performance. This technique will work especially well if you have a large number of short-lived connections.

Greg Hewgill