tags:

views:

17

answers:

1

After having learned about MapReduce for solving a computer vision problem for my recent internship at Google, I felt like an enlightened person. I had been using R for text mining already. I wanted to use R for large scale text processing and for experiments with topic modeling. I started reading tutorials and working on some of those. I will now put down my understanding of each of the tools:

1) R text mining toolbox: Meant for local (client side) text processing and it uses the XML library

2) Hive: Hadoop interative, provides the framework to call map/reduce and also provides the DFS interface for storing files on the DFS.

3) RHIPE: R Hadoop integrated environment

4) Elastic MapReduce with R: a MapReduce framework for those who do not have their own clusters

5) Distributed Text Mining with R: An attempt to make seamless move form local to server side processing, from R-tm to R-distributed-tm

I have the following questions and confusions about the above packages

1) Hive and RHIPE and the distributed text mining toolbox need you to have your own clusters. Right?

2) If I have just one computer how would DFS work in case of HIVE

3) Are we facing with the problem of duplication of effort with the above packages?

I am hoping to get insights on the above questions in the next few days

A: 

I'm not familiar with the distributed text mining with R application, but Hive can run on a local cluster or on a single-node cluster. This can be done for experimenting or in practice, but does defeat the purpose of having a distributed file system for serious work. As far as duplication of effort, Hive is meant to be a complete SQL implementation on top of Hadoop, so there is duplication in as much as both SQL and R can both work with text data, but not in as much as both are specific tools with different strengths.

Jakob Homan