Hey Som,
Say if I want to convert 1000s of word
files to pdf then would using Hadoop
to approach this problem make sense?
Would using Hadoop have any advantage
over simply using multiple EC2
instances with job queues?
I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.
Also if there was 1 file and 10 free
nodes then would hadoop split the file
and send it to the 10 nodes or will
the file be sent to just 1 node while
9 sit idle?
It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.
Good luck with whatever tool you choose for this problem!
Regards,
Jeff