questions about hdfs | ansaurus

hdfs

CloudStore vs. HDFS

Does anyone have any familiarity with working with both CloudStore and HDFS. I am interested to see how far CloudStore has been scaled and how heavily it has been used in production. CloudStore seems to be more full featured than HDFS. When thinking about these two filesystems what practical trade offs are there? ...

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to ...

Scalable Image Storage

Hi, I'm currently designing an architecture for a web-based application that should also provide some kind of image storage. Users will be able to upload photos as one of the key feature of the service. Also viewing these images will be one of the primary usages (via web). However, I'm not sure how to realize such a scalable image sto...

Where HDFS stores files locally by default?

Hi, I am running hadoop with default configuration with one-node cluster, and would like to find where HDFS stores files locally. Any ideas? Thanks. ...

Are there any existing batch log file aggregation solutions?

I wish to export from multiple nodes log files (in my case apache access and error logs) and aggregate that data in batch, as a scheduled job. I have seen multiple solutions that work with streaming data (i.e think scribe). I would like a tool that gives me the flexibility to define the destination. This requirement comes from the fa...

Hadoop dfs -ls returns list of files in my hadoop/ dir

I've set up a sigle-node Hadoop configuration running via cygwin under Win7. After starting Hadoop bybin/start-all.sh I run bin/hadoop dfs -ls which returns me a list of files in my hadoop directory. Then I run bin/hadoop datanode -formatbin/hadoop namenode -format but -ls still returns me the contents of my hadoop directory. As far as I...

CouchDB, HDFS, HBase or which is right for my situation?

Hello all, This question is regarding data storage systems such as CouchDB, HDFS and HBase, specifically, which is right. I am looking at making a simple and customized Document Management System for my organization. Basically, we need the ability to store some Word Documents, PDFs and other similar files. I also want to store metada...

opening lucene index stored in hdfs

How to read a lucene index directory stored over HDFS i.e. How to get IndexReader for the index stored over HDFS. The IndexReader is to opened in a map task. Something like: IndexReader reader = IndexReader.open("hdfs/path/to/index/directory"); Thanks, Akhil ...

Hadoop safemode recovery - taking lot of time

Hi, We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services. 609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode' 610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode' 611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon...

FileInputStream for a generic file System

I have a file that contains java serialized objects like "Vector". I have stored this file over Hadoop Distributed File System(HDFS). Now I intend to read this file (using method readObject) in one of the map task. I suppose FileInputStream in = new FileInputStream("hdfs/path/to/file"); wont' work as the file is stored over HDFS. So ...

Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3.

Hi, I have setup a Hadoop cluster containing 5 nodes on Amazon EC2. Now, when i login into the Master node and submit the following command bin/hadoop jar <program>.jar <arg1> <arg2> <path/to/input/file/on/S3> It throws the following errors (not at the same time.) The first error is thrown when i don't replace the slashes with '%2F' ...

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs. I'm particularly interested in something that will work wi...

Is there any distributed file system which runs on Windows except Hadoop?

I'm desperate to find any DFS which supports Windows. The only such DFS is Hadoop HDFS but it's very hard to deploy it other big number of Windows machines because it requires Cygwin + SSH. Almost all DFS systems work only on Linux and only one (HDFS) runs on Windows. I would be very grateful if somebody will be able to point me to oth...

How to write and read files in/from Hadoop HDFS using Ruby?

Is there a way to work with HDFS Api using Ruby? As I can understand there is no multilanguage file Api and the only way is to use native Java Api. I tried using JRuby but this solution is to unstable and not very native. Also I looked at HDFS Thrift Api but it's not complete and also lacks many features (like writing to indexed files). ...

Is it possible to use Avro with Hadoop 0.20?

I'm interested in using Avro to save and read files from Hadoop HDFS and I saw some Jira's in Hadoop issue tracker regarding implementing support for Avro but there were no examples how to enable Avro support in Hadoop. Also I'm not completely sure that current 0.20 has support for Avro because some Jira's were closed for 0.21. Is it pos...

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system. As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as...

How to copy files from HDFS to S3 effectively programatically

Hi, My hadoop job generate large number of files on HDFS and I want to write a separate thread which will copy these files from HDFS to S3. Could any one point me to any java API that handles it. Thanks ...

1