tag : hadoop

Hadoop Multiple Input and Output

The following is an example of using multiple inputs (org.apache.hadoop.mapreduce.lib.input.MultipleInputs) with different input formats and different mapper implementations. MultipleInputs.addInputP

Hadoop Customize Data Type

Customize Data Type - As ValueTo create a customized data type used as a value, the data type must implement the org.apache.hadoop.io.Writable interface which consists of the two methods, readFields(

Hadoop Balancer

Whenever the nodes are added to the cluster or lots of data are delete, we need to run Hadoop balancer to balance the data in the datenodes. Or else, the over utilized data nodes will become the bottl

Adding or Removing Hadoop Nodes

I am here to give step completely. I saw some version of step before, but many of them either are not complete or confused or wrong, e.g. someone even stop the cluster to do that! Adding Nodes In the

Useful Hadoop ToolRunner

Developers are pissed off with following things quite often: When you write job configuration in the code of map and reduce, you need to repack everything if there changes on paramenters You need to

Sorting of Big Data

If you want to sort big data set by keys, there are following ways to do that global sort (sorting on single keys) by mapreduce scripts: get sample of data to find out data ranges by keys; sort data

Hadoop distcp Utility

Hadoop distcp create a map only job to copy data across clusters Copy the weblogs folder from cluster A to cluster B: hadoop distcp hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs Copy t

Hadoop Job Patterns

In order to handle complex hadoop jobs, such as DAG, you can use oozie. Here, I am talking about the native support of job handlers in the Hadoop MapReduce framework. The limitation is that it only s

Hadoop Customize Input Output Format

To customize inputformat, we need to do following Create customized InputFormat class by extending Hadoop inputformat with your own type of class, such as public class LogFileInputFormat extends Fil

Hadoop Split and Block

There are two confusing concepts in Hadoop, block and split Block - A Physical DivisionHDFS was designed to hold and manage large amounts of data; a default block size is 64 MB. That means if a 128-MB