Hadoop Balancer

Whenever the nodes are added to the cluster or lots of data are delete, we need to run Hadoop balancer to balance the data in the datenodes. Or else, the over utilized data nodes will become the bottl

Adding or Removing Hadoop Nodes

I am here to give step completely. I saw some version of step before, but many of them either are not complete or confused or wrong, e.g. someone even stop the cluster to do that! Adding Nodes In the

Useful Hadoop ToolRunner

Developers are pissed off with following things quite often: When you write job configuration in the code of map and reduce, you need to repack everything if there changes on paramenters You need to

Hive vs. Pig

Both projects are top Apache projects to process data in Hadoop. Here, I try to compare the difference. Below is picture I found (I cannot find the original link, but there is mirror here In addition,

When to Disable Speculative Execution

BackgroundsThis is the link from WikiMedia about what’s Speculative Execution. In Hadoop, the following parameters string are for this settings. And, they are true by default. mapred.map.tasks.specul

MRUnit for Now

Cloudera MRUnit will help with unit testing of mapreduce programming. Below is its support so far. The MapDriver and ReduceDriver support only a single key as input, which can make it more cumbersom

Hive Sorting and Ordering

There are following key words used in Hive to sort data with following difference: ORDER BY (ASC|DESC) : This is similar to the traditional SQL operator. Sorted order is maintained across all of the

Sorting of Big Data

If you want to sort big data set by keys, there are following ways to do that global sort (sorting on single keys) by mapreduce scripts: get sample of data to find out data ranges by keys; sort data

Commonly Used Hive Setting

I just list very commonly used ones. Set dynamic partition (a column name) SET hive.exec.dynamic.partition=true; Affect insert rows to save in sampled format SET hive.enforce.bucketing = true; Reduce

GIT Cheat Sheet

Here I am sharing two good git sheet as follows. http://ndpsoftware.com/git-cheatsheet.html