Apache Hadoop Deployment: A Blueprint for Reliable Distributed Computing
This Refcard presents a basic blueprint for deploying Apache Hadoop HDFS and MapReduce in development and production environments. Check out DZone Refcard #117, Getting Started with Apache Hadoop for basic terminology and for an overview of the tools available in the Hadoop Project.
Deploying Hadoop Live Code
All the bash shell commands in the Refcard are available for cutting and pasting directly onto your console session from this list.
- Hadoop SSH Prerequisites
- Ubuntu Pre-Install Setup
- Red Hat Pre-Install Setup
- Set the Hadoop Run-Time Environment
- Pseudo-Distributed Operation Config
- Testing the Hadoop Installation
- Job Completion and Daemon Termination
- Installing CDH
- Adding Optional Components
- Starting the CDH Daemons
- Testing the CDH Installation
- Fetching Daemon Metrics
- Set the Production Configuration
- Create a New File System
- Stop All Daemons
- File System Setup
- Starting a Node Example
- Set the MapReduce Directory Up
- Minimal HDFS Config Update
- Minimal MapReduce Config Update
Here are the links for the companion Getting Started with Apache Hadoop, Scalability and High Availability Systems Refcard, and the NoSQL and Data Scalability Refcardz - enjoy!
Scalable Systems Newsletter
Subscribe to the newsletter and get every issue mailed free - with access to the latest system scalability, high availability, and performance news.
All Refcard Listings Ready to Paste
Listing 1 - Hadoop SSH Prerequisites
keyFile=$HOME/.ssh/id_rsa.pub
pKeyFile=$HOME/.ssh/id_rsa
authKeys=$HOME/.ssh/authorized_keys
if ! ssh localhost -C true ; then \
if [ ! -e "$keyFile" ]; then \
ssh-keygen -t rsa -b 2048 -P '' \
-f "$pKeyFile"; \
fi; \
cat "$keyFile" >> "$authKeys"; \
chmod 0640 "$authKeys"; \
echo "Hadoop SSH configured"; \
else echo "Hadoop SSH OK"; fi
Return to top
Listing 2 - Ubuntu Pre-install Setup
DISTRO=$(lsb_release -c | cut -f 2)
REPO=/etc/apt/sources.list.d/cloudera.list
echo "deb \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib" > "$REPO"
echo "deb-src \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib" >> "$REPO"
apt-get update
Return to top
Listing 3 - Red Hat Pre-install Setup
curl -sL http://is.gd/3ynKY7 | tee \
/etc/yum.repos.d/cloudera-cdh3.repo | \
awk '/^name/'
yum update yum
Return to top
Listing 4 - Set the Hadoop Runtime Environment
version=0.20.2 # change if needed
identity="hadoop-dev"
runtimeEnv="runtime/conf/hadoop-env.sh"
ln -s hadoop-"$version" runtime
ln -s runtime/logs .
export HADOOP_HOME="$HOME"
cp "$runtimeEnv" "$runtimeEnv".org
echo "export \
HADOOP_SLAVES=$HADOOP_HOME/slaves" \
>> "$runtimeEnv"
mkdir "$HADOOP_HOME"/slaves
echo \
"export HADOOP_IDENT_STRING=$identity" >> \
"$runtimeEnv"
echo \
"export JAVA_HOME=$JAVA_HOME" \
>> "$runtimeEnv"
export \
PATH=$PATH:"$HADOOP_HOME"/runtime/bin
unset version; unset identity; unset runtimeEnv
Return to top
Listing 5 - Pseudo-Distributed Operation Config
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Return to top
Listing 6 - Testing the Hadoop Installation
start-all.sh ; sleep 5
hadoop fs -put runtime/conf input
hadoop jar runtime/hadoop-*-examples.jar\
grep input output 'dfs[a-z.]+'
Return to top
Listing 7 - Job Completion and Daemon Termination
hadoop fs -cat output/*
stop-all.sh
Return to top
Listing 8 - Installing CDH
ver="0.20"
command="/usr/bin/aptitude"
if [ ! -e "$command" ];
then command="/usr/bin/yum"; fi
"$command" install\
hadoop-"$ver"-conf-pseudo
unset command ; unset ver
Return to top
Listing 9 - Additional Optional Components
apt-get install hadoop-pig
apt-get install flume
apt-get install sqoop
Return to top
Listing 10 - Starting the CDH Daemons
for s in /etc/init.d/hadoop* ; do \
"$s" start; done
Return to top
Listing 11 - Testing the CDH Installation
hadoop fs -ls /
# run a job:
pushd /usr/lib/hadoop
hadoop fs -put /etc/hadoop/conf input
hadoop fs -ls input
hadoop jar hadoop-*-examples.jar \
grep input output 'dfs[a-z.]+'
# Validate it ran OK:
hadoop fs -cat output/*
Return to top
Listing 12 - Fetching Daemon Metrics
http://localhost:50070/metrics?format=json
Return to top
Listing 13 - Set the Production Configuration
ver="0.20"
prodConf="/etc/hadoop-$ver/conf.prod"
cp -Rfv /etc/hadoop-"$ver"/conf.empty \
"$prodConf"
chown hadoop:hadoop "$prodConf"
# activate the new configuration:
alt="/usr/sbin/update-alternatives"
if [ ! -e "$alt" ]; then alt="/usr/sbin/alternatives"; fi
"$alt" --install /etc/hadoop-"$ver"/conf \
hadoop-"$ver"-conf "$prodConf" 50
for h in /etc/init.d/hadoop-"$ver"-*; do \
"$h" restart; done
Return to top
Listing 14 - Create a New File System
sudo -u hdfs hadoop namenode -format
Return to top
Listing 15 - Stop All Daemons
# Run this in every node
ver=0.20
for h in /etc/init.d/hadoop-"$ver"-*; do \
"$h" stop ;\
# Optional command for auto-start:
update-rc.d "$h" defaults; \
done
Return to top
Listing 16 -
mkdir -p /data/1/dfs/nn /data/2/dfs/nn
mkdir -p /data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
mkdir -p /data/1/mapred/local \
/data/2/mapred/local
chown -R hdfs:hadoop /data/1/dfs/nn \
/data/2/dfs/nn /data/1/dfs/dn \
/data/2/dfs/dn /data/3/dfs/dn \
/data/4/dfs/dn
chown -R mapred:hadoop \
/data/1/mapred/local \
/data/2/mapred/local
chmod -R 755 /data/1/dfs/nn \
/data/2/dfs/nn \
/data/1/dfs/dn /data/2/dfs/dn \
/data/3/dfs/dn /data/4/dfs/dn
chmod -R 755 /data/1/mapred/local \
/data/2/mapred/local
Return to top
Listing 17 -
ver="0.20"
/etc/init.d/hadoop-"$ver"-namenode start
Return to top
Listing 18 - Set the MapReduce Directory Up
sudo -u hdfs hadoop fs -mkdir \
/mapred/system
sudo -u hdfs hadoop fs -chown mapred \
/mapred/system
Return to top
Listing 19 - Minimal HDFS Config Update
<property>
<name>dfs.name.dir</name>
<value>/data/1/dfs/nn,/data/2/dfs/nn
</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>
/data/1/dfs/dn,/data/2/dfs/dn,
/data/3/dfs/dn,/data/4/dfs/dn
</value>
<final>true</final>
</property>
Return to top
Listing 20 - Minimal MapReduce Config Update
<!-- mapred-site.xml -->
<property>
<name>mapred.local.dir</name>
<value>
/data/1/mapred/local,
/data/2/mapred/local
</value>
<final>true</final>
</property>
<property>
<name>mapred.systemdir</name>
<value>
/mapred/system
</value>
<final>true</final>
</property>
Return to top
Acknowledgements
Thanks to Pavel Dovbush for his assistance in setting up the live code examples in this web page. Pavel is a Rich Internet Application design and implementation guru.