archive: 2012

Hadoop distcp Utility

Hadoop distcp create a map only job to copy data across clusters Copy the weblogs folder from cluster A to cluster B: hadoop distcp hdfs://namenodeA/data/weblogs hdfs://namenodeB/data/weblogs Copy t

Hadoop Job Patterns

In order to handle complex hadoop jobs, such as DAG, you can use oozie. Here, I am talking about the native support of job handlers in the Hadoop MapReduce framework. The limitation is that it only s

A Tiny Scrum Overview

Three Roles: Product Owner, Scrum master, Team The Product owner: Own and prioritizes the Product Backlog defines and prioritizes features–owns the gathering of requirements agrees to iteration groun

Hadoop Customize Input Output Format

To customize inputformat, we need to do following Create customized InputFormat class by extending Hadoop inputformat with your own type of class, such as public class LogFileInputFormat extends Fil

Hadoop Split and Block

There are two confusing concepts in Hadoop, block and split Block - A Physical DivisionHDFS was designed to hold and manage large amounts of data; a default block size is 64 MB. That means if a 128-MB

Hadoop Serialization Framework

Below are some mentioned in Hadoop In Practice and MapReduce Cookbooks Where to use serialization In order to be used as a value data type of a MapReduce computation, a data type must implement the or

Rsync for HBase/Hadoop Cluster Deployment

Create a simple rsync script to do HBase/Hadoop deployment Create a cluster-deploy.sh script, shown as follows: $ vi cluster-deploy.sh #!/bin/bash # Sync HBASE_HOME across the cluster. Must run on ma

Hive Performance Tuning - No. of MapReduce

1. Set proper number of map Most of time, the job will generate one or multiple map task through number of input directories. There are factors, such as number of input files, the size of input files

Hive Regular Expression SerDe

Unless there are no way to user internal parser, I do not recommend write user defined SerDe. Do not forget that Hive comes with a contrib RegexSerDeclass, which can tokenize your logs/files to resol

Little About MapReduce Combiner

Combiner is used to reduce the number of split shuffling to reducer. It will improve the overall performance obviously. There are following two points to be attention of using it. Your map and reduc