The Tesla Testament
Amazon best-seller December 2007!

Non-stop action. A vulnerable hero. A quest to save the world. The Tesla Testament is the most exciting novel of the decade.


Developing With Google App Engine

This book introduces development with Google App Engine, a platform that provides developers and users with infrastructure that Google itself uses for deploying their massively scalable applications.

Member of The Internet Defense League

Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing

This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out DZone Refcard #117, Getting Started with Apache Hadoop for basic terminology and for an overview of the tools available in the Hadoop Project.

Deploying Hadoop Live Code

All the bash shell commands in the Refcard are available for cutting and pasting directly onto your console session from this list.

  1. Hadoop SSH Prerequisites
  2. Ubuntu Pre-Install Setup
  3. Red Hat Pre-Install Setup
  4. Set the Hadoop Run-Time Environment
  5. Pseudo-Distributed Operation Config
  6. Testing the Hadoop Installation
  7. Job Completion and Daemon Termination
  8. Installing CDH
  9. Adding Optional Components
  10. Starting the CDH Daemons
  11. Testing the CDH Installation
  12. Fetching Daemon Metrics
  13. Set the Production Configuration
  14. Create a New File System
  15. Stop All Daemons
  16. File System Setup
  17. Starting a Node Example
  18. Set the MapReduce Directory Up
  19. Minimal HDFS Config Update
  20. Minimal MapReduce Config Update

This handy Hadoop reference card is available for download from DZone .

Here are the links for the companion Getting Started with Apache Hadoop, Scalability and High Availability Systems Refcard, and the NoSQL and Data Scalability Refcardz - enjoy!

Scalable Systems Newsletter

Subscribe to the newsletter and get every issue mailed free - with access to the latest system scalability, high availability, and performance news.


All Refcard Listings Ready to Paste

Listing 1 - Hadoop SSH Prerequisites


    

keyFile=$HOME/.ssh/id_rsa.pub

pKeyFile=$HOME/.ssh/id_rsa

authKeys=$HOME/.ssh/authorized_keys

if ! ssh localhost -C true ; then \

  if [ ! -e "$keyFile" ]; then \

    ssh-keygen -t rsa -b 2048 -P '' \

       -f "$pKeyFile"; \

  fi; \

  cat "$keyFile" >> "$authKeys"; \

  chmod 0640 "$authKeys"; \

  echo "Hadoop SSH configured"; \

else echo "Hadoop SSH OK"; fi

    

  
Return to top

Listing 2 - Ubuntu Pre-install Setup


    

DISTRO=$(lsb_release -c | cut -f 2)

REPO=/etc/apt/sources.list.d/cloudera.list

echo "deb \

http://archive.cloudera.com/debian \

    $DISTRO-cdh3 contrib" > "$REPO"

echo "deb-src \

http://archive.cloudera.com/debian \

    $DISTRO-cdh3 contrib" >> "$REPO"

apt-get update

    

  
Return to top

Listing 3 - Red Hat Pre-install Setup


    

curl -sL http://is.gd/3ynKY7 | tee \

    /etc/yum.repos.d/cloudera-cdh3.repo | \

    awk '/^name/'

yum update yum

    

  
Return to top

Listing 4 - Set the Hadoop Runtime Environment


    

version=0.20.2  # change if needed

identity="hadoop-dev"

runtimeEnv="runtime/conf/hadoop-env.sh"

ln -s hadoop-"$version" runtime

ln -s runtime/logs .

export HADOOP_HOME="$HOME"

cp "$runtimeEnv" "$runtimeEnv".org

echo "export \

HADOOP_SLAVES=$HADOOP_HOME/slaves" \

>> "$runtimeEnv"

mkdir "$HADOOP_HOME"/slaves

echo  \

"export HADOOP_IDENT_STRING=$identity" >> \

"$runtimeEnv"

echo  \

"export JAVA_HOME=$JAVA_HOME" \

>> "$runtimeEnv"

export \

PATH=$PATH:"$HADOOP_HOME"/runtime/bin

unset version; unset identity; unset runtimeEnv

    

  
Return to top

Listing 5 - Pseudo-Distributed Operation Config


    

<!-- core-site.xml --> 

<configuration>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://localhost:9000</value>

  </property>

</configuration>



<!-- hdfs-site.xml -->

<configuration>

  <property>

    <name>dfs.replication</name>

    <value>1</value>

  </property>

</configuration>



<!-- mapred-site.xml -->

<configuration>

  <property>

    <name>mapred.job.tracker</name>

    <value>localhost:9001</value>

  </property>

</configuration>

    

  
Return to top

Listing 6 - Testing the Hadoop Installation


    

start-all.sh ; sleep 5

hadoop fs -put runtime/conf input

hadoop jar runtime/hadoop-*-examples.jar\

  grep input output 'dfs[a-z.]+'

    

  
Return to top

Listing 7 - Job Completion and Daemon Termination


    

hadoop fs -cat output/*

stop-all.sh

    

  
Return to top

Listing 8 - Installing CDH


    

ver="0.20"

command="/usr/bin/aptitude"

if [ ! -e "$command" ];

then command="/usr/bin/yum"; fi

"$command" install\

hadoop-"$ver"-conf-pseudo

unset command ; unset ver

    

  
Return to top

Listing 9 - Additional Optional Components


    

apt-get install hadoop-pig

apt-get install flume

apt-get install sqoop

    

  
Return to top

Listing 10 - Starting the CDH Daemons


    

for s in /etc/init.d/hadoop* ; do \

"$s" start; done

    

  
Return to top

Listing 11 - Testing the CDH Installation


    

hadoop fs -ls /

# run a job:

pushd /usr/lib/hadoop

hadoop fs -put /etc/hadoop/conf input

hadoop fs -ls input

hadoop jar hadoop-*-examples.jar \

  grep input output 'dfs[a-z.]+'

# Validate it ran OK:

hadoop fs -cat output/*

    

  
Return to top

Listing 12 - Fetching Daemon Metrics


    

http://localhost:50070/metrics?format=json

    

  
Return to top

Listing 13 - Set the Production Configuration


    

ver="0.20"

prodConf="/etc/hadoop-$ver/conf.prod"

cp -Rfv /etc/hadoop-"$ver"/conf.empty \

"$prodConf"

chown hadoop:hadoop "$prodConf"

# activate the new configuration:

alt="/usr/sbin/update-alternatives"

if [ ! -e "$alt" ]; then alt="/usr/sbin/alternatives"; fi

"$alt" --install /etc/hadoop-"$ver"/conf \

hadoop-"$ver"-conf "$prodConf" 50

for h in /etc/init.d/hadoop-"$ver"-*; do \

"$h" restart; done

    

  
Return to top

Listing 14 - Create a New File System


    

sudo -u hdfs hadoop namenode -format

    

  
Return to top

Listing 15 - Stop All Daemons


    

# Run this in every node

ver=0.20

for h in /etc/init.d/hadoop-"$ver"-*; do \

"$h" stop ;\

# Optional command for auto-start:

update-rc.d "$h" defaults; \

done

    

  
Return to top

Listing 16 -


    

mkdir -p /data/1/dfs/nn /data/2/dfs/nn

mkdir -p /data/1/dfs/dn /data/2/dfs/dn \

/data/3/dfs/dn /data/4/dfs/dn

mkdir -p /data/1/mapred/local \

/data/2/mapred/local

chown -R hdfs:hadoop /data/1/dfs/nn \

/data/2/dfs/nn /data/1/dfs/dn \

/data/2/dfs/dn /data/3/dfs/dn \

/data/4/dfs/dn

chown -R mapred:hadoop \

/data/1/mapred/local \

/data/2/mapred/local

chmod -R 755 /data/1/dfs/nn \

/data/2/dfs/nn \

/data/1/dfs/dn /data/2/dfs/dn \

/data/3/dfs/dn /data/4/dfs/dn

chmod -R 755 /data/1/mapred/local \

/data/2/mapred/local

    

  
Return to top

Listing 17 -


    

ver="0.20"

/etc/init.d/hadoop-"$ver"-namenode start

    

  
Return to top

Listing 18 - Set the MapReduce Directory Up


    

sudo -u hdfs hadoop fs -mkdir \

/mapred/system

sudo -u hdfs hadoop fs -chown mapred \

/mapred/system

    

  
Return to top

Listing 19 - Minimal HDFS Config Update


    



<property>

  <name>dfs.name.dir</name>

  <value>/data/1/dfs/nn,/data/2/dfs/nn

  </value>

  <final>true</final>

</property>

<property>

  <name>dfs.data.dir</name>

  <value>

    /data/1/dfs/dn,/data/2/dfs/dn,

    /data/3/dfs/dn,/data/4/dfs/dn

</value>

  <final>true</final>

</property>

    

  
Return to top

Listing 20 - Minimal MapReduce Config Update


    

<!-- mapred-site.xml -->

<property>

  <name>mapred.local.dir</name>

  <value>

    /data/1/mapred/local,

    /data/2/mapred/local

  </value>

  <final>true</final>

</property>

<property>

  <name>mapred.systemdir</name>

  <value>

    /mapred/system

  </value>

  <final>true</final>

</property>

    

  
Return to top

Acknowledgements

Thanks to Pavel Dovbush for his assistance in setting up the live code examples in this web page. Pavel is a Rich Internet Application design and implementation guru.