views:

260

answers:

1

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?

If there are n files, does the "InputFormat" just see it all as 1 continuous file?

+2  A: 

There's a big difference. It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks.

Take a look at this blog post from Cloudera: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

If you can avoid creating lots of files, do so. Concatenate when possible. Large splittable files are MUCH better for Hadoop.

I once ran Pig on the netflix dataset. It took hours to process just a few gigs. I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes.

SquareCog