hadoop

Hadoop DFS Error

Can someone tell me what I am doing wrong? 2009/08/10 11:33:07 [INFO] - Copying local:/ X/Y/Z.txt to DFS:/X/Y/Z.txt 2009/08/10 11:33:07 [INFO] - put: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=superman, access=WRITE, inode="":big-build:supergroup:rwxr-xr-x 2009/08/10 11:33:08 [FATAL] - DFS error...

Hadoop DFS Permission Error

2009/08/11 13:25:39 [INFO] - put: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=yskhoo, access=WRITE, inode="":bad-boy:supergroup:rwxr-xr-x Why do I keep getting this error? Also is it bad that I am writing to a blank inode? ...

Distributing Video on a LAN to alternate Locations - Can the browser detect this?

I'm the administrator for a company intranet and I'd like to start producing videos. However, we have a very small bandwidth tunnel between our locations, and I'd like to avoid hogging it by streaming videos by multiple users. I'd like to synchronize the files to servers at each of the locations. Then I'd like the browser (or the intran...

Using the Apache Mahout machine learning libraries

I've been working with the Apache Mahout machine learning libaries in my free time a bit over the past few weeks. I'm curious to hear about how others are using these libraries. ...

Splitting input into substrings in PIG (Hadoop)

Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that? ...

Wiping out DFS in Hadoop

How do I wipe out the DFS in Hadoop? ...

Hadoop Distribution Differences

Can somebody outline the various differences between the various Hadoop Distributions available: Cloudera - http://www.cloudera.com/hadoop Yahoo - http://developer.yahoo.net/blogs/hadoop/ using the Apache Hadoop distro as a baseline. Is there a good reason to using one of these distributions over the standard Apache Hadoop distro?...

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL. Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip,...

Look up values in a BDB for several files in parallel

What is the most efficient way to look up values in a BDB for several files in parallel? If I had a Perl script which did this for one file at a time, would forking/running the process in background with the ampersand in Linux work? How might Hadoop be used to solve this problem? Would threading be another solution? ...

Advanced queries in HBase

Given the following HBase schema scenario (from the official FAQ)... How would you design an Hbase table for many-to-many association between two entities, for example Student and Course? I would define two tables: Student: student id student data (name, address, ...) courses (use course ids as column qualifiers h...

Java vs Python on Hadoop

I am working on a project using Hadoop and it seems to natively incorporate Java and provide streaming support for Python. Is there is a significant performance impact to choosing one over the other? I am early enough in the process where I can go either way if there is a significant performance difference one way or the other. ...

CloudStore vs. HDFS

Does anyone have any familiarity with working with both CloudStore and HDFS. I am interested to see how far CloudStore has been scaled and how heavily it has been used in production. CloudStore seems to be more full featured than HDFS. When thinking about these two filesystems what practical trade offs are there? ...

Get the task attempt ID for the currently running Hadoop task

The Task Side-Effect Files section of the Hadoop tutorial mentions using the "attemptid" of the task as a unique name. How do I get this attempt ID in my mapper or reducer? ...

Setting up a (Linux) Hadoop cluster

Do you need to set up a Linux cluster first in order to setup a Hadoop cluster ? ...

Sorting the values before they are send to the reducer.

I'm thinking about building a small testing application in hadoop to get the hang of the system. The application I have in mind will be in the realm of doing statistics. I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys). What I have pla...

Life without JOINs... understanding, and common practices

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lots of data that is very related? I can on...

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to ...

What is Hadoop ?

Hi, I want to know what Hadoop is ? I have gone through Google and Wikipedia but I am not clear of what actually Hadoop is and what is the goal of it. Any useful information would be highly appreciated. Note: Please do not provide link to wiki as I have read it but am looking for detail explanation. Thanks. ...

Is Hadoop right for running my simulations?

have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. T...

Can Hadoop be restricted to spare CPU cycles?

Is it possible to run Hadoop so that it only uses spare CPU cycles? I.e. would it be feasible to install Hadoop on peoples work machines so that number crunching can be done when they are not using their PCs, and they wouldn't experience an obvious performance drain (whurring fans aside!). Perhaps it's just be a case of setting the JVM...