tags:

views:

260

answers:

2

I'm using Hadoop for data processing with python, what file format should be used?

I have project with a substantial amount of text pages.

Each text file has some header information that I need to preserve during the processing; however, I don't want the headers to interfere with the clustering algorithms.

I'm using python on Hadoop (or is there a sub package better suited?)

How should I format my text files, and store those text files in Hadoop for processing?

+2  A: 

1) Files

If you use Hadoop Streaming, you have to use line-based text-files, data up to the first tab is passed to your mapper as key.

Just look at the documentation for streaming.

You can also put you input-files into HDFS, which would be recommendable for big files. Just look at the "Large Files"-section in the above link.

2) Metadata-preservation

The problem I see is that your header information (metadata) will only be treated as such data, so you have to filter it out by yourself (first step). To pass it along is more difficult, as the data of all input files will just be joined after the map-step.

You will have to add the metadata somewhere to the data itself (second step) to be able to relate it later. You could emit (key, data+metadata) for each data-line of file and thus be able to preserve the metadata for each data-line. Might be a huge overhead, but we are talking MapReduce, means: pfffrrrr ;)

Now comes the part where I don't know how much streaming really differs from a Java-implemented job. IF streaming invokes one mapper per file, you could spare yourself the following trouble, just take the first input of map() as metadata and add it (or a placeholder) to all following data-emits. If not, the next is about Java-Jobs:

At least with a JAR-mapper you can relate the data to its input-file (see here). But you would have to extract the metadata first, as the map-function just might be invoked on a partition of the file not containing the metadata. I'd propose something like that:

  • create a metadata-file beforehand, containing an placeholder-index: keyx:filex, metadatax
  • put this metadata-index into the HDFS
  • use a JAR-mapper, load during setup() the metadata-index-file
    • see org.apache.hadoop.hdfs.DFSClient
  • match filex, set keyx for this mapper
  • add to each emitted data-line in map() the used keyx
Leonidas
Ok, I think I can write things using xml for the files and split the text files into something like this:<meta_data></meta_data><actual_info></actual_info> for each text file.Your explanation is much appreciated -- but I'm definitely lost :~(
lw2010
That will help you to determine the meta_data in map() better. Nevertheless you still have to preserve it during the whole map() to be able to add it to every datum, if you have a combining reduce()-phase later.See: map() is called successively with the pairs (key1, value1), (key2, value2), ..., (key_maximumline, value_maximumline) for each file/metadata-container. But reduce() is just called with (map_emitted_key, (list of all map_emitted_values for this key)) ... so when you want to know the metadata during reduce, you have to add it to the map_emitted_value somehow (or use the key ...)
Leonidas
+1  A: 

If you're using Hadoop Streaming, your input can be in any line-based format; your mapper and reducer input comes from sys.stdin, which you read any way you want. You don't need to use the default tab-deliminated fields (although in my experience, one format should be used among all tasks for consistency when possible).

However, with the default splitter and partitioner, you cannot control how your input and output is partitioned or sorted, so you your mappers and reducers must decide whether any particular line is a header line or a data line using only that line - they won't know the original file boundaries.

You may be able to specify a partitioner which lets a mapper assume that the first input line is the first line in a file, or even move away from a line-based format. This was hard to do the last time I tried with Streaming, and in my opinion mapper and reducer tasks should be input agnostic for efficiency and reusability - it's best to think of a stream of input records, rather than keeping track of file boundaries.

Another option with Streaming is to ship header information in a separate file, which is included with your data. It will be available to your mappers and reducers in their working directories. One idea would be to associate each line with the appropriate header information in an inital task, perhaps by using three fields per line instead of two, rather than associating them by file.

In general, try and treat the input as a stream and don't rely on file boundaries, input size, or order. All of these restrictions can be implemented, but at the cost of complexity. If you do need to implement them, do so at the beginning or end of your task chain.

If you're using Jython or SWIG, you may have other options, but I found those harder to work with than Streaming.

Karl Anderson