views:

350

answers:

4

Hi

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS.

I understand that-

  • Pig's language Pig Latin is a shift from(suits the way programmers think) SQL like declarative style of programming and Hive's query language closely resembles SQL.

  • Pig sits on top of Hadoop and in principle can also sit on top of Dryad. I might be wrong but Hive is closely coupled to Hadoop.

  • Both Pig Latin and Hive commands compiles to Map and Reduce jobs.

My question - What is the goal of having both when one (say Pig) could serve the purpose. Is it just because Pig is evangelized by Yahoo! and Hive by Facebook?

cheers

+3  A: 

I believe that the real answer to your question is that they are/were independent projects and there was no centrally coordinated goal. They were in different spaces early on and have grown to overlap with time as both projects expand.

Paraphrased from the Hadoop O'Reilly book:

Pig: a dataflow language and environment for exploring very large datasets.

Hive: a distributed data warehouse

Hive is supposed to be closer to a traditional RDBMS than Pig.

  • This starts with the scripting language (HiveQL vs Pig Latin)
  • Hive is designed to store data in tables, with a managed schema. Pig primarily processes large flat files.
Greg Harman
Hive is nothing like a RDBMS. It processes flat files just like Pig. They both basically do the same thing. Look at the optimizers that they use when compiling the job as that is the largest real difference.
Steve
+3  A: 

You can achieve similar results with pig/hive queries. The main difference lies within approach to understanding/writing/creating queries.

Pig tends to create a flow of data: small steps where in each you do some processing
Hive gives you SQL-like language to operate on your data, so transformation from RDBMS is much easier (Pig can be easier for someone who had not earlier experience with SQL)

It is also worth noting, that for Hive you can nice interface to work with this data (Beeswax for HUE, or Hive web interface), and it also gives you metastore for information about your data (schema, etc) which is useful as a central information about your data.

I use both Hive and Pig, for different queries (I use that one where I can write query faster/easier, I do it this way mostly ad-hoc queries) - they can use the same data as an input. But currently I'm doing much of my work through Beeswax.

Wojtek
+2  A: 

Check out this post from Alan Gates, Pig architect at Yahoo!, that compares when would use a SQL like Hive rather than Pig: http://developer.yahoo.net/blogs/hadoop/2010/01/comparing_pig_latin_and_sql_fo.html He makes a very convincing case as to the usefulness of a declarative language like Pig and its utility to dataflow designers.

Jakob Homan
+1  A: 

Hive was designed to appeal to a community comfortable with SQL. It's philosophy was that we don't need yet another scripting language. Hive supports map and reduce transform scripts in the language of user's choice (which can be embedded within sql clauses). It is widely used in Facebook by analysts comfortable with SQL as well as data miners programming in Python. SQL compatibility efforts in Pig have been abandoned AFAIK - so the difference between the two projects is very clear.

Supporting SQL syntax also means that it's possible to integrate with existing BI tools like Microstrategy. Hive has a ODBC/JDBC driver (that's a work in progress) that should allow this to happen in the near future. It's also beginning to add support for indices that should allow support for drill down queries common in such environments.

Finally - this is not pertinent to the question directly - Hive is framework for performing analytic queries. While it's dominant use is to query flat files - there's no reason why it cannot query other stores. Currently Hive can be used to query data stored in Hbase (which is a key-value store like those found in the guts of most RDBMS) - and the HadoopDB project has used Hive to query federated RDBMS tier.

Joydeep Sen Sarma