How to deploy single-node Hadoop setup in AWS

Ivo Vachkov — Wed, 04 Feb 2015 08:19:25 +0000

Common issue in the Software Development Lifecycle is the need to quickly bootstrap vanilla environment, deploy some code onto it, run it and then scrap it. This is a core concept in Continuous Integration / Continuous Delivery (CI/CD). It is a stepping stone towards immutable infrastructure. Properly automated implementation can also save time (no need to configure it manually) and money (no need to track potential regression issues in the development process).

Over the course of several years, found this to be extremely useful when used in BigData projects that use Hadoop. Installation of Hadoop is not always straight-forward. It depends on various internal and external components (JDK, Map-Reduce Framework, HDFS, etc). It can be messy. Different components communicate over various ports and protocols. HDFS uses somewhat clumsy semantics to deal with files and directories. For those and similar reasons we decided to present our take on Hadoop installation on a single node for development purposes.

The following shell script is simplified, fully functional skeleton implementation that will install Hadoop on a c3.xlarge, Fedora 20 node in AWS and run a test job on it:

#!/bin/bash

# Key file to be generated and its filesystem location
KEY_NAME="test-hadoop-key"
KEY_FILE="/tmp/$KEY_NAME"

# Security group name and description
SG_NAME="test-hadoop-sg"
SG_DESC="Test Hadoop Security Group"

# Temporary files; General Log and Instance User data
LOG_FILE="/tmp/test-hadoop-setup.log"
USR_DATA="/tmp/test-hadoop-userdata.sh"

# Instance details
AWS_PROFILE="$$profile$$"
AWS_REGION="us-east-1"
AMI_ID="ami-21362b48"
INST_TAG="test-hadoop-single"
INST_TYPE="c3.xlarge"
DISK_SIZE="20"

# Default return codes
RET_CODE_OK=0
RET_CODE_ERROR=1

# Check for various utilities that will be used 

# Check for supported operating system
P_UNAME=`whereis uname | cut -d' ' -f2`
if [ ! -x "$P_UNAME" ]; then
	echo "$0: No UNAME available in the system"
	exit $RET_CODE_ERROR;
fi
OS=`$P_UNAME`
if [ "$OS" != "Linux" ]; then
	echo "$0: Unsupported OS!";
	exit $RET_CODE_ERROR;
fi

# Check if awscli is available in the system
P_AWS=`whereis aws | cut -d' ' -f2`
if [ ! -x "$P_AWS" ]; then
	echo "$0: No 'aws' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if awk is available in the system
P_AWK=`whereis awk | cut -d' ' -f2`
if [ ! -x "$P_AWK" ]; then
	echo "$0: No 'awk' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if grep is available in the system
P_GREP=`whereis grep | cut -d' ' -f2`
if [ ! -x "$P_GREP" ]; then
	echo "$0: No 'grep' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if sed is available in the system
P_SED=`whereis sed | cut -d' ' -f2`
if [ ! -x "$P_SED" ]; then
	echo "$0: No 'sed' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if ssh is available in the system
P_SSH=`whereis ssh | cut -d' ' -f2`
if [ ! -x "$P_SSH" ]; then
	echo "$0: No 'ssh' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if ssh-keygen is available in the system
P_SSH_KEYGEN=`whereis ssh-keygen | cut -d' ' -f2`
if [ ! -x "$P_SSH_KEYGEN" ]; then
	echo "$0: No 'ssh-keygen' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Userdata code to bootstrap Hadoop 2.X on Fedora 20 instance
cat > $USR_DATA << "EOF"
#!/bin/bash

# Mark execution start
echo "START" > /root/userdata.state

# Install Hadoop
yum --assumeyes install hadoop-common hadoop-common-native hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn

# Configure HDFS
hdfs-create-dirs

# Bootstrap Hadoop services
systemctl start hadoop-namenode && sleep 2
systemctl start hadoop-datanode && sleep 2
systemctl start hadoop-nodemanager && sleep 2
systemctl start hadoop-resourcemanager && sleep 2

# Make Hadoop services start after reboot
systemctl enable hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager

# Configure Hadoop user
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/fedora"
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown fedora /user/fedora"

# Deploy additional software dependencies
# ... 

# Deploy main application 
# ... 

# Mark execution end
echo "DONE" > /root/userdata.state
EOF

# Create Security Group
echo -n "Creating '$SG_NAME' security group ... "
aws ec2 create-security-group --group-name $SG_NAME --description "$SG_DESC" --region $AWS_REGION --profile $AWS_PROFILE > $LOG_FILE
echo "Done."

# Add open SSH access
echo -n "Adding access rules to '$SG_NAME' security group ... "
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 22 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

# Add open Hadoop ports access
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 8088 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50010 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50020 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50030 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50070 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50075 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50090 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Generate New Key Pair and Import it
echo -n "Generating key pair '$KEY_NAME' for general access ... "
rm -rf $KEY_FILE $KEY_FILE.pub
ssh-keygen -t rsa -f $KEY_FILE -N '' >> $LOG_FILE
aws ec2 import-key-pair --key-name $KEY_NAME --public-key-material "`cat $KEY_FILE.pub`" --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Build the Hadoop box
echo -n "Starting Hadoop instance ... "
RI_OUT=`aws ec2 run-instances --image-id $AMI_ID --count 1 --instance-type $INST_TYPE --key-name $KEY_NAME --security-groups $SG_NAME --user-data file:///tmp/test-hadoop-userdata.sh --block-device-mapping "[{\"DeviceName\":\"/dev/sda1\", \"Ebs\":{\"VolumeSize\":$DISK_SIZE, \"DeleteOnTermination\": true} } ]" --region $AWS_REGION --profile $AWS_PROFILE`
I_ID=`echo $RI_OUT | grep "InstanceId" | awk '{print $43}' | sed 's/,$//' | sed -e 's/^"//'  -e 's/"$//'`
echo $RI_OUT >> $LOG_FILE
echo "Done."

# Tag the Hadoop box
echo -n "Tagging Hadoop instance '$I_ID' ... "
aws ec2 create-tags --resources $I_ID --tags Key=Name,Value=$INST_TAG --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Obtain instance public IP address
echo -n "Obtaining instance '$I_ID' public hostname ... "

# Delays in AWS fabric, reiterate until public hostname is assigned ...
while true; do
	sleep 3

	HOST=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep PublicDnsName | awk -F":" '{print $2}' | awk '{print $1}' | sed 's/,$//' | sed -e 's/^"//'  -e 's/"$//'`;
	if [[ $HOST == ec2* ]]; then
		break;
	fi
done
echo "Done."

# Poll until system is ready
echo -n "Waiting for instance '$I_ID' to configure itself (will take approx. 5 minutes) ... "
while true; do
	sleep 5;

	TEMP_OUT=`ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE -t fedora@$HOST "sudo cat /root/userdata.state"`;

	# Clear some strange symbols 
	STATE=`echo $TEMP_OUT | cut -c1-4`;

	if [ "$STATE" = "DONE" ]; then
		break;
	fi
done
echo "Done."

# Test Hadoop setup
echo "========== Testing Single-node Hadoop =========="
ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000"
echo "========== Done =========="

# Run main Application here
# echo "========== Testing Main Application Single-node Hadoop =========="
# ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar ..."
# echo "========== Done =========="

# Terminate instance
echo -n "Terminating Hadoop instance '$I_ID' ... "
aws ec2 terminate-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

# Poll until instance is terminated
while true; do
	sleep 5;

	TERMINATED=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep terminated`;
	if [ ! -z "$TERMINATED" ]; then
		break;
	fi
done
echo "Done."

# Remove SSH Keypair
echo -n "Removing key pair '$KEY_NAME' ... "
aws ec2 delete-key-pair --key-name $KEY_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Remove Security Group
echo -n "Removing '$SG_NAME' security group ... "
aws ec2 delete-security-group --group-name $SG_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Remove local resources
rm -rf $USR_DATA
rm -rf $KEY_FILE $KEY_FILE.pub
rm -rf $LOG_FILE

# Normal termination
exit $RET_CODE_OK

Additional notes:

Please, edit the AWS_PROFILE variable. AWS CLI commands depend on this!
Activity log is kept in /tmp/test-hadoop-setup.log and will be recreated with every new run of the script.
In case of normal execution, all allocated resources will be cleaned upon termination.
This script is ready to be used as Jenkins build-and-deploy job.
Since the single-node Hadoop/HDFS is terminated, output data that goes to HDFS should be transferred out of the instance before termination!

Example run should look like:

:~> ./aws-hadoop-single.sh
Creating 'test-hadoop-sg' security group ... Done.
Adding access rules to 'test-hadoop-sg' security group ... Done.
Generating key pair 'test-hadoop-key' for general access ... Done.
Starting Hadoop instance ... Done.
Tagging Hadoop instance 'i-b3b27f5c' ... Done.
Obtaining instance 'i-b3b27f5c' public hostname ... Done.
Waiting for instance 'i-b3b27f5c' to configure itself (will take approx. 5 minutes) ... Done.
========== Testing Single-node Hadoop ==========
Number of Maps  = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/02/04 07:27:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/04 07:27:05 INFO input.FileInputFormat: Total input paths to process : 10
15/02/04 07:27:05 INFO mapreduce.JobSubmitter: number of splits:10
15/02/04 07:27:05 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/02/04 07:27:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423034805647_0001
15/02/04 07:27:05 INFO impl.YarnClientImpl: Submitted application application_1423034805647_0001 to ResourceManager at /0.0.0.0:8032
15/02/04 07:27:05 INFO mapreduce.Job: The url to track the job: http://ip-10-63-188-40:8088/proxy/application_1423034805647_0001/
15/02/04 07:27:05 INFO mapreduce.Job: Running job: job_1423034805647_0001
15/02/04 07:27:11 INFO mapreduce.Job: Job job_1423034805647_0001 running in uber mode : false
15/02/04 07:27:11 INFO mapreduce.Job:  map 0% reduce 0%
15/02/04 07:27:24 INFO mapreduce.Job:  map 60% reduce 0%
15/02/04 07:27:33 INFO mapreduce.Job:  map 100% reduce 0%
15/02/04 07:27:34 INFO mapreduce.Job:  map 100% reduce 100%
15/02/04 07:27:34 INFO mapreduce.Job: Job job_1423034805647_0001 completed successfully
Job Finished in 29.302 seconds
15/02/04 07:27:34 INFO mapreduce.Job: Counters: 43
        File System Counters
                FILE: Number of bytes read=226
                FILE: Number of bytes written=882378
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2660
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=43
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=10
                Launched reduce tasks=1
                Data-local map tasks=10
                Total time spent by all maps in occupied slots (ms)=93289
                Total time spent by all reduces in occupied slots (ms)=7055
        Map-Reduce Framework
                Map input records=10
                Map output records=20
                Map output bytes=180
                Map output materialized bytes=280
                Input split bytes=1480
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=280
                Reduce input records=20
                Reduce output records=0
                Spilled Records=40
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=1561
                CPU time spent (ms)=7210
                Physical memory (bytes) snapshot=2750681088
                Virtual memory (bytes) snapshot=11076927488
                Total committed heap usage (bytes)=2197291008
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Estimated value of Pi is 3.14158440000000000000
========== Done ==========
Terminating Hadoop instance 'i-b3b27f5c' ... Done.
Removing key pair 'test-hadoop-key' ... Done.
Removing 'test-hadoop-sg' security group ... Done.
:~>

Hopefully, this short introduction will advance your efforts to automate development tasks in BigData projects!

If you want to discuss more complex scenarios including automated deployments over multi-node Hadoop clusters, AWS Elastic MapReduce, AWS DataPipeline or other components of the BigData ecosystem, do not hesitate to Contact Us!

References

Xi Group Ltd. Company Blog » Xi Group Ltd. Company Blog » Big Data

How to deploy single-node Hadoop setup in AWS

Related Posts