tags:

views:

249

answers:

2

Logs Tcpdumps are binary files, i wanna know what FileInputFormat of hadoop i should use for split chunks the input data...please help me!!

+2  A: 

There was a thread on the user list about this: http://hadoop.markmail.org/search/list:org%2Eapache%2Ehadoop%2Ecore-user+pcap+order:date-forward

Basically, the format is not splittable as you can't find a start of a record starting at an arbitrary offset in the file. So you have to do some preprocessing, inserting syncpoints or something similar. Maybe covert smaller files into sequencefiles, and then merge the small sequencefiles?

If you wind up writing something reusable, please consider contributing back to the project.

SquareCog
+1  A: 

Write an InputFormat that reads PCAP files, returning something like LongWritable for the key (the nth packet in the file) and PacketWritable as the value (containing the PCAP data). For the InputSplit you can use FileSplit, or MultiFileSplit for better performance, as an individual PCAP file can be read surprisingly quickly.

Unless your blocksize is larger than the size of your pcap files, you will experience lots of network IO...

jonathan-stafford