Converting word docs to pdf using Hadoop

tags:

hadoop

views:

162

answers:

Converting word docs to pdf using Hadoop

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

+2 A:

There isn't much advantage in using hadoop for this use case. Having competing consumers read from a queue and producing output is going to be a lot easier to setup and will probably be more efficient.

Hadoop would not automatically split a document and process sections on differnt nodes. Although if you had a really big (many thousands of pages long) then the Hadoop use case would make sense - but only when the time to produce a pdf on a single machine is significant.

The map tasks could print a few thousand pages each and the reduce task merge the PDF's into a single document - although reading the resulting file may be difficult to read if it is very large.

Robert Christie 2009-12-29 12:23:21

Hey Som,

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.

Good luck with whatever tool you choose for this problem!

Regards, Jeff

Jeff Hammerbacher 2010-01-01 14:47:38

ansaurus

tags:

views:

answers:

Converting word docs to pdf using Hadoop

related questions