Thoughts, Rants, Etc.

The Tesla Testament
Amazon best-seller December 2007!

Non-stop action. A vulnerable hero. A quest to save the world. The Tesla Testament is the most exciting novel of the decade.

Developing With Google App Engine

This book introduces development with Google App Engine, a platform that provides developers and users with infrastructure that Google itself uses for deploying their massively scalable applications.

Member of The Internet Defense League

Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out DZone Refcard #117, Getting Started with Apache Hadoop for basic terminology and for an overview of the tools available in the Hadoop Project.

Deploying Hadoop Live Code

All the bash shell commands in the Refcard are available for cutting and pasting directly onto your console session from this list.

Hadoop SSH Prerequisites
Ubuntu Pre-Install Setup
Red Hat Pre-Install Setup
Set the Hadoop Run-Time Environment
Pseudo-Distributed Operation Config
Testing the Hadoop Installation
Job Completion and Daemon Termination
Installing CDH
Adding Optional Components
Starting the CDH Daemons
Testing the CDH Installation
Fetching Daemon Metrics
Set the Production Configuration
Create a New File System
Stop All Daemons
File System Setup
Starting a Node Example
Set the MapReduce Directory Up
Minimal HDFS Config Update
Minimal MapReduce Config Update

This handy Hadoop reference card is available for download from DZone .

Here are the links for the companion Getting Started with Apache Hadoop, Scalability and High Availability Systems Refcard, and the NoSQL and Data Scalability Refcardz - enjoy!

Scalable Systems Newsletter

Subscribe to the newsletter and get every issue mailed free - with access to the latest system scalability, high availability, and performance news.

All Refcard Listings Ready to Paste

Listing 1 - Hadoop SSH Prerequisites


    

keyFile=$HOME/.ssh/id_rsa.pub

pKeyFile=$HOME/.ssh/id_rsa

authKeys=$HOME/.ssh/authorized_keys

if ! ssh localhost -C true ; then \

  if [ ! -e "$keyFile" ]; then \

    ssh-keygen -t rsa -b 2048 -P '' \

       -f "$pKeyFile"; \

  fi; \

  cat "$keyFile" >> "$authKeys"; \

  chmod 0640 "$authKeys"; \

  echo "Hadoop SSH configured"; \

else echo "Hadoop SSH OK"; fi

Listing 2 - Ubuntu Pre-install Setup


    

DISTRO=$(lsb_release -c | cut -f 2)

REPO=/etc/apt/sources.list.d/cloudera.list

echo "deb \

http://archive.cloudera.com/debian \

    $DISTRO-cdh3 contrib" > "$REPO"

echo "deb-src \

http://archive.cloudera.com/debian \

    $DISTRO-cdh3 contrib" >> "$REPO"

apt-get update

Listing 3 - Red Hat Pre-install Setup


    

curl -sL http://is.gd/3ynKY7 | tee \

    /etc/yum.repos.d/cloudera-cdh3.repo | \

    awk '/^name/'

yum update yum

Listing 4 - Set the Hadoop Runtime Environment


    

version=0.20.2  # change if needed

identity="hadoop-dev"

runtimeEnv="runtime/conf/hadoop-env.sh"

ln -s hadoop-"$version" runtime

ln -s runtime/logs .

export HADOOP_HOME="$HOME"

cp "$runtimeEnv" "$runtimeEnv".org

echo "export \

HADOOP_SLAVES=$HADOOP_HOME/slaves" \

>> "$runtimeEnv"

mkdir "$HADOOP_HOME"/slaves

echo  \

"export HADOOP_IDENT_STRING=$identity" >> \

"$runtimeEnv"

echo  \

"export JAVA_HOME=$JAVA_HOME" \

>> "$runtimeEnv"

export \

PATH=$PATH:"$HADOOP_HOME"/runtime/bin

unset version; unset identity; unset runtimeEnv

Listing 5 - Pseudo-Distributed Operation Config


    

<!-- core-site.xml --> 

<configuration>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://localhost:9000</value>

  </property>

</configuration>



<!-- hdfs-site.xml -->

<configuration>

  <property>

    <name>dfs.replication</name>

    <value>1</value>

  </property>

</configuration>



<!-- mapred-site.xml -->

<configuration>

  <property>

    <name>mapred.job.tracker</name>

    <value>localhost:9001</value>

  </property>

</configuration>

Listing 6 - Testing the Hadoop Installation


    

start-all.sh ; sleep 5

hadoop fs -put runtime/conf input

hadoop jar runtime/hadoop-*-examples.jar\

  grep input output 'dfs[a-z.]+'

Listing 7 - Job Completion and Daemon Termination


    

hadoop fs -cat output/*

stop-all.sh

Listing 8 - Installing CDH


    

ver="0.20"

command="/usr/bin/aptitude"

if [ ! -e "$command" ];

then command="/usr/bin/yum"; fi

"$command" install\

hadoop-"$ver"-conf-pseudo

unset command ; unset ver

Listing 9 - Additional Optional Components


    

apt-get install hadoop-pig

apt-get install flume

apt-get install sqoop

Listing 10 - Starting the CDH Daemons


    

for s in /etc/init.d/hadoop* ; do \

"$s" start; done

Listing 11 - Testing the CDH Installation


    

hadoop fs -ls /

# run a job:

pushd /usr/lib/hadoop

hadoop fs -put /etc/hadoop/conf input

hadoop fs -ls input

hadoop jar hadoop-*-examples.jar \

  grep input output 'dfs[a-z.]+'

# Validate it ran OK:

hadoop fs -cat output/*

Listing 12 - Fetching Daemon Metrics


    

http://localhost:50070/metrics?format=json

Listing 13 - Set the Production Configuration


    

ver="0.20"

prodConf="/etc/hadoop-$ver/conf.prod"

cp -Rfv /etc/hadoop-"$ver"/conf.empty \

"$prodConf"

chown hadoop:hadoop "$prodConf"

# activate the new configuration:

alt="/usr/sbin/update-alternatives"

if [ ! -e "$alt" ]; then alt="/usr/sbin/alternatives"; fi

"$alt" --install /etc/hadoop-"$ver"/conf \

hadoop-"$ver"-conf "$prodConf" 50

for h in /etc/init.d/hadoop-"$ver"-*; do \

"$h" restart; done

Listing 14 - Create a New File System


    

sudo -u hdfs hadoop namenode -format

Listing 15 - Stop All Daemons


    

# Run this in every node

ver=0.20

for h in /etc/init.d/hadoop-"$ver"-*; do \

"$h" stop ;\

# Optional command for auto-start:

update-rc.d "$h" defaults; \

done

Listing 16 -


    

mkdir -p /data/1/dfs/nn /data/2/dfs/nn

mkdir -p /data/1/dfs/dn /data/2/dfs/dn \

/data/3/dfs/dn /data/4/dfs/dn

mkdir -p /data/1/mapred/local \

/data/2/mapred/local

chown -R hdfs:hadoop /data/1/dfs/nn \

/data/2/dfs/nn /data/1/dfs/dn \

/data/2/dfs/dn /data/3/dfs/dn \

/data/4/dfs/dn

chown -R mapred:hadoop \

/data/1/mapred/local \

/data/2/mapred/local

chmod -R 755 /data/1/dfs/nn \

/data/2/dfs/nn \

/data/1/dfs/dn /data/2/dfs/dn \

/data/3/dfs/dn /data/4/dfs/dn

chmod -R 755 /data/1/mapred/local \

/data/2/mapred/local

Listing 17 -


    

ver="0.20"

/etc/init.d/hadoop-"$ver"-namenode start

Listing 18 - Set the MapReduce Directory Up


    

sudo -u hdfs hadoop fs -mkdir \

/mapred/system

sudo -u hdfs hadoop fs -chown mapred \

/mapred/system

Listing 19 - Minimal HDFS Config Update


    



<property>

  <name>dfs.name.dir</name>

  <value>/data/1/dfs/nn,/data/2/dfs/nn

  </value>

  <final>true</final>

</property>

<property>

  <name>dfs.data.dir</name>

  <value>

    /data/1/dfs/dn,/data/2/dfs/dn,

    /data/3/dfs/dn,/data/4/dfs/dn

</value>

  <final>true</final>

</property>

Listing 20 - Minimal MapReduce Config Update


    

<!-- mapred-site.xml -->

<property>

  <name>mapred.local.dir</name>

  <value>

    /data/1/mapred/local,

    /data/2/mapred/local

  </value>

  <final>true</final>

</property>

<property>

  <name>mapred.systemdir</name>

  <value>

    /mapred/system

  </value>

  <final>true</final>

</property>

Acknowledgements

Thanks to Pavel Dovbush for his assistance in setting up the live code examples in this web page. Pavel is a Rich Internet Application design and implementation guru.