How to deploy single-node Hadoop cluster in AWS

Common issue in the Software Development Lifecycle is the need to quickly bootstrap vanilla environment, deploy some code onto it, run it and then scrap it. This is a core concept in Continuous Integration / Continuous Delivery (CI/CD). It is a stepping stone towards immutable infrastructure. Properly automated implementation can also save time (no need to configure it manually) and money (no need to track potential regression issues in the development process).

Over the course of several years, found this to be extremely useful when used in BigData projects that use Hadoop. Installation of Hadoop is not always straight-forward. It depends on various internal and external components (JDK, Map-Reduce Framework, HDFS, etc). It can be messy. Different components communicate over various ports and protocols. HDFS uses somewhat clumsy semantics to deal with files and directories. For those and similar reasons we decided to present our take on Hadoop installation on a single node for development purposes.

The following shell script is simplified, fully functional skeleton implementation that will install Hadoop on a c3.xlarge, Fedora 20 node in AWS and run a test job on it:

#!/bin/bash

# Key file to be generated and its filesystem location
KEY_NAME="test-hadoop-key"
KEY_FILE="/tmp/$KEY_NAME"

# Security group name and description
SG_NAME="test-hadoop-sg"
SG_DESC="Test Hadoop Security Group"

# Temporary files; General Log and Instance User data
LOG_FILE="/tmp/test-hadoop-setup.log"
USR_DATA="/tmp/test-hadoop-userdata.sh"

# Instance details
AWS_PROFILE="$$profile$$"
AWS_REGION="us-east-1"
AMI_ID="ami-21362b48"
INST_TAG="test-hadoop-single"
INST_TYPE="c3.xlarge"
DISK_SIZE="20"

# Default return codes
RET_CODE_OK=0
RET_CODE_ERROR=1

# Check for various utilities that will be used 

# Check for supported operating system
P_UNAME=`whereis uname | cut -d' ' -f2`
if [ ! -x "$P_UNAME" ]; then
	echo "$0: No UNAME available in the system"
	exit $RET_CODE_ERROR;
fi
OS=`$P_UNAME`
if [ "$OS" != "Linux" ]; then
	echo "$0: Unsupported OS!";
	exit $RET_CODE_ERROR;
fi

# Check if awscli is available in the system
P_AWS=`whereis aws | cut -d' ' -f2`
if [ ! -x "$P_AWS" ]; then
	echo "$0: No 'aws' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if awk is available in the system
P_AWK=`whereis awk | cut -d' ' -f2`
if [ ! -x "$P_AWK" ]; then
	echo "$0: No 'awk' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if grep is available in the system
P_GREP=`whereis grep | cut -d' ' -f2`
if [ ! -x "$P_GREP" ]; then
	echo "$0: No 'grep' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if sed is available in the system
P_SED=`whereis sed | cut -d' ' -f2`
if [ ! -x "$P_SED" ]; then
	echo "$0: No 'sed' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if ssh is available in the system
P_SSH=`whereis ssh | cut -d' ' -f2`
if [ ! -x "$P_SSH" ]; then
	echo "$0: No 'ssh' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check if ssh-keygen is available in the system
P_SSH_KEYGEN=`whereis ssh-keygen | cut -d' ' -f2`
if [ ! -x "$P_SSH_KEYGEN" ]; then
	echo "$0: No 'ssh-keygen' available in the system!";
	exit $RET_CODE_ERROR;
fi

# Userdata code to bootstrap Hadoop 2.X on Fedora 20 instance
cat > $USR_DATA << "EOF"
#!/bin/bash

# Mark execution start
echo "START" > /root/userdata.state

# Install Hadoop
yum --assumeyes install hadoop-common hadoop-common-native hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn

# Configure HDFS
hdfs-create-dirs

# Bootstrap Hadoop services
systemctl start hadoop-namenode && sleep 2
systemctl start hadoop-datanode && sleep 2
systemctl start hadoop-nodemanager && sleep 2
systemctl start hadoop-resourcemanager && sleep 2

# Make Hadoop services start after reboot
systemctl enable hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager

# Configure Hadoop user
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/fedora"
runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown fedora /user/fedora"

# Deploy additional software dependencies
# ... 

# Deploy main application 
# ... 

# Mark execution end
echo "DONE" > /root/userdata.state
EOF

# Create Security Group
echo -n "Creating '$SG_NAME' security group ... "
aws ec2 create-security-group --group-name $SG_NAME --description "$SG_DESC" --region $AWS_REGION --profile $AWS_PROFILE > $LOG_FILE
echo "Done."

# Add open SSH access
echo -n "Adding access rules to '$SG_NAME' security group ... "
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 22 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

# Add open Hadoop ports access
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 8088 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50010 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50020 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50030 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50070 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50075 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50090 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Generate New Key Pair and Import it
echo -n "Generating key pair '$KEY_NAME' for general access ... "
rm -rf $KEY_FILE $KEY_FILE.pub
ssh-keygen -t rsa -f $KEY_FILE -N '' >> $LOG_FILE
aws ec2 import-key-pair --key-name $KEY_NAME --public-key-material "`cat $KEY_FILE.pub`" --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Build the Hadoop box
echo -n "Starting Hadoop instance ... "
RI_OUT=`aws ec2 run-instances --image-id $AMI_ID --count 1 --instance-type $INST_TYPE --key-name $KEY_NAME --security-groups $SG_NAME --user-data file:///tmp/test-hadoop-userdata.sh --block-device-mapping "[{\"DeviceName\":\"/dev/sda1\", \"Ebs\":{\"VolumeSize\":$DISK_SIZE, \"DeleteOnTermination\": true} } ]" --region $AWS_REGION --profile $AWS_PROFILE`
I_ID=`echo $RI_OUT | grep "InstanceId" | awk '{print $43}' | sed 's/,$//' | sed -e 's/^"//'  -e 's/"$//'`
echo $RI_OUT >> $LOG_FILE
echo "Done."

# Tag the Hadoop box
echo -n "Tagging Hadoop instance '$I_ID' ... "
aws ec2 create-tags --resources $I_ID --tags Key=Name,Value=$INST_TAG --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Obtain instance public IP address
echo -n "Obtaining instance '$I_ID' public hostname ... "

# Delays in AWS fabric, reiterate until public hostname is assigned ...
while true; do
	sleep 3

	HOST=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep PublicDnsName | awk -F":" '{print $2}' | awk '{print $1}' | sed 's/,$//' | sed -e 's/^"//'  -e 's/"$//'`;
	if [[ $HOST == ec2* ]]; then
		break;
	fi
done
echo "Done."

# Poll until system is ready
echo -n "Waiting for instance '$I_ID' to configure itself (will take approx. 5 minutes) ... "
while true; do
	sleep 5;

	TEMP_OUT=`ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE -t fedora@$HOST "sudo cat /root/userdata.state"`;

	# Clear some strange symbols 
	STATE=`echo $TEMP_OUT | cut -c1-4`;

	if [ "$STATE" = "DONE" ]; then
		break;
	fi
done
echo "Done."

# Test Hadoop setup
echo "========== Testing Single-node Hadoop =========="
ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000"
echo "========== Done =========="

# Run main Application here
# echo "========== Testing Main Application Single-node Hadoop =========="
# ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar ..."
# echo "========== Done =========="

# Terminate instance
echo -n "Terminating Hadoop instance '$I_ID' ... "
aws ec2 terminate-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

# Poll until instance is terminated
while true; do
	sleep 5;

	TERMINATED=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep terminated`;
	if [ ! -z "$TERMINATED" ]; then
		break;
	fi
done
echo "Done."

# Remove SSH Keypair
echo -n "Removing key pair '$KEY_NAME' ... "
aws ec2 delete-key-pair --key-name $KEY_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Remove Security Group
echo -n "Removing '$SG_NAME' security group ... "
aws ec2 delete-security-group --group-name $SG_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE
echo "Done."

# Remove local resources
rm -rf $USR_DATA
rm -rf $KEY_FILE $KEY_FILE.pub
rm -rf $LOG_FILE

# Normal termination
exit $RET_CODE_OK

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

#!/bin/bash

# Key file to be generated and its filesystem location

KEY_NAME="test-hadoop-key"

KEY_FILE="/tmp/$KEY_NAME"

# Security group name and description

SG_NAME="test-hadoop-sg"

SG_DESC="Test Hadoop Security Group"

# Temporary files; General Log and Instance User data

LOG_FILE="/tmp/test-hadoop-setup.log"

USR_DATA="/tmp/test-hadoop-userdata.sh"

# Instance details

AWS_PROFILE="$$profile$$"

AWS_REGION="us-east-1"

AMI_ID="ami-21362b48"

INST_TAG="test-hadoop-single"

INST_TYPE="c3.xlarge"

DISK_SIZE="20"

# Default return codes

RET_CODE_OK=0

RET_CODE_ERROR=1

# Check for various utilities that will be used

# Check for supported operating system

P_UNAME=`whereis uname | cut -d' ' -f2`

if [ ! -x "$P_UNAME" ]; then

echo "$0: No UNAME available in the system"

exit $RET_CODE_ERROR;

OS=`$P_UNAME`

if [ "$OS" != "Linux" ]; then

echo "$0: Unsupported OS!";

exit $RET_CODE_ERROR;

# Check if awscli is available in the system

P_AWS=`whereis aws | cut -d' ' -f2`

if [ ! -x "$P_AWS" ]; then

echo "$0: No 'aws' available in the system!";

exit $RET_CODE_ERROR;

# Check if awk is available in the system

P_AWK=`whereis awk | cut -d' ' -f2`

if [ ! -x "$P_AWK" ]; then

echo "$0: No 'awk' available in the system!";

exit $RET_CODE_ERROR;

# Check if grep is available in the system

P_GREP=`whereis grep | cut -d' ' -f2`

if [ ! -x "$P_GREP" ]; then

echo "$0: No 'grep' available in the system!";

exit $RET_CODE_ERROR;

# Check if sed is available in the system

P_SED=`whereis sed | cut -d' ' -f2`

if [ ! -x "$P_SED" ]; then

echo "$0: No 'sed' available in the system!";

exit $RET_CODE_ERROR;

# Check if ssh is available in the system

P_SSH=`whereis ssh | cut -d' ' -f2`

if [ ! -x "$P_SSH" ]; then

echo "$0: No 'ssh' available in the system!";

exit $RET_CODE_ERROR;

# Check if ssh-keygen is available in the system

P_SSH_KEYGEN=`whereis ssh-keygen | cut -d' ' -f2`

if [ ! -x "$P_SSH_KEYGEN" ]; then

echo "$0: No 'ssh-keygen' available in the system!";

exit $RET_CODE_ERROR;

# Userdata code to bootstrap Hadoop 2.X on Fedora 20 instance

cat > $USR_DATA << "EOF"

#!/bin/bash

# Mark execution start

echo "START" > /root/userdata.state

# Install Hadoop

yum --assumeyes install hadoop-common hadoop-common-native hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn

# Configure HDFS

hdfs-create-dirs

# Bootstrap Hadoop services

systemctl start hadoop-namenode && sleep 2

systemctl start hadoop-datanode && sleep 2

systemctl start hadoop-nodemanager && sleep 2

systemctl start hadoop-resourcemanager && sleep 2

# Make Hadoop services start after reboot

systemctl enable hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager

# Configure Hadoop user

runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/fedora"

runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown fedora /user/fedora"

# Deploy additional software dependencies

# ...

# Deploy main application

# ...

# Mark execution end

echo "DONE" > /root/userdata.state

EOF

# Create Security Group

echo -n "Creating '$SG_NAME' security group ... "

aws ec2 create-security-group --group-name $SG_NAME --description "$SG_DESC" --region $AWS_REGION --profile $AWS_PROFILE > $LOG_FILE

echo "Done."

# Add open SSH access

echo -n "Adding access rules to '$SG_NAME' security group ... "

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 22 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

# Add open Hadoop ports access

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 8088 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50010 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50020 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50030 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50070 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50075 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50090 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

echo "Done."

# Generate New Key Pair and Import it

echo -n "Generating key pair '$KEY_NAME' for general access ... "

rm -rf $KEY_FILE $KEY_FILE.pub

ssh-keygen -t rsa -f $KEY_FILE -N '' >> $LOG_FILE

aws ec2 import-key-pair --key-name $KEY_NAME --public-key-material "`cat $KEY_FILE.pub`" --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

echo "Done."

# Build the Hadoop box

echo -n "Starting Hadoop instance ... "

RI_OUT=`aws ec2 run-instances --image-id $AMI_ID --count 1 --instance-type $INST_TYPE --key-name $KEY_NAME --security-groups $SG_NAME --user-data file:///tmp/test-hadoop-userdata.sh --block-device-mapping "[{\"DeviceName\":\"/dev/sda1\", \"Ebs\":{\"VolumeSize\":$DISK_SIZE, \"DeleteOnTermination\": true} } ]" --region $AWS_REGION --profile $AWS_PROFILE`

I_ID=`echo $RI_OUT | grep "InstanceId" | awk '{print $43}' | sed 's/,$//' | sed -e 's/^"//' -e 's/"$//'`

echo $RI_OUT >> $LOG_FILE

echo "Done."

# Tag the Hadoop box

echo -n "Tagging Hadoop instance '$I_ID' ... "

aws ec2 create-tags --resources $I_ID --tags Key=Name,Value=$INST_TAG --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

echo "Done."

# Obtain instance public IP address

echo -n "Obtaining instance '$I_ID' public hostname ... "

# Delays in AWS fabric, reiterate until public hostname is assigned ...

while true; do

sleep 3

HOST=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep PublicDnsName | awk -F":" '{print $2}' | awk '{print $1}' | sed 's/,$//' | sed -e 's/^"//' -e 's/"$//'`;

if [[ $HOST == ec2* ]]; then

break;

done

echo "Done."

# Poll until system is ready

echo -n "Waiting for instance '$I_ID' to configure itself (will take approx. 5 minutes) ... "

while true; do

sleep 5;

TEMP_OUT=`ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE -t fedora@$HOST "sudo cat /root/userdata.state"`;

# Clear some strange symbols

STATE=`echo $TEMP_OUT | cut -c1-4`;

if [ "$STATE" = "DONE" ]; then

break;

done

echo "Done."

# Test Hadoop setup

echo "========== Testing Single-node Hadoop =========="

ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000"

echo "========== Done =========="

# Run main Application here

# echo "========== Testing Main Application Single-node Hadoop =========="

# ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar ..."

# echo "========== Done =========="

# Terminate instance

echo -n "Terminating Hadoop instance '$I_ID' ... "

aws ec2 terminate-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

# Poll until instance is terminated

while true; do

sleep 5;

TERMINATED=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep terminated`;

if [ ! -z "$TERMINATED" ]; then

break;

done

echo "Done."

# Remove SSH Keypair

echo -n "Removing key pair '$KEY_NAME' ... "

aws ec2 delete-key-pair --key-name $KEY_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

echo "Done."

# Remove Security Group

echo -n "Removing '$SG_NAME' security group ... "

aws ec2 delete-security-group --group-name $SG_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE

echo "Done."

# Remove local resources

rm -rf $USR_DATA

rm -rf $KEY_FILE $KEY_FILE.pub

rm -rf $LOG_FILE

# Normal termination

exit $RET_CODE_OK

Additional notes:

Please, edit the AWS_PROFILE variable. AWS CLI commands depend on this!
Activity log is kept in /tmp/test-hadoop-setup.log and will be recreated with every new run of the script.
In case of normal execution, all allocated resources will be cleaned upon termination.
This script is ready to be used as Jenkins build-and-deploy job.
Since the single-node Hadoop/HDFS is terminated, output data that goes to HDFS should be transferred out of the instance before termination!

Example run should look like:

:~> ./aws-hadoop-single.sh
Creating 'test-hadoop-sg' security group ... Done.
Adding access rules to 'test-hadoop-sg' security group ... Done.
Generating key pair 'test-hadoop-key' for general access ... Done.
Starting Hadoop instance ... Done.
Tagging Hadoop instance 'i-b3b27f5c' ... Done.
Obtaining instance 'i-b3b27f5c' public hostname ... Done.
Waiting for instance 'i-b3b27f5c' to configure itself (will take approx. 5 minutes) ... Done.
========== Testing Single-node Hadoop ==========
Number of Maps  = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
15/02/04 07:27:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/04 07:27:05 INFO input.FileInputFormat: Total input paths to process : 10
15/02/04 07:27:05 INFO mapreduce.JobSubmitter: number of splits:10
15/02/04 07:27:05 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/02/04 07:27:05 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/02/04 07:27:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423034805647_0001
15/02/04 07:27:05 INFO impl.YarnClientImpl: Submitted application application_1423034805647_0001 to ResourceManager at /0.0.0.0:8032
15/02/04 07:27:05 INFO mapreduce.Job: The url to track the job: http://ip-10-63-188-40:8088/proxy/application_1423034805647_0001/
15/02/04 07:27:05 INFO mapreduce.Job: Running job: job_1423034805647_0001
15/02/04 07:27:11 INFO mapreduce.Job: Job job_1423034805647_0001 running in uber mode : false
15/02/04 07:27:11 INFO mapreduce.Job:  map 0% reduce 0%
15/02/04 07:27:24 INFO mapreduce.Job:  map 60% reduce 0%
15/02/04 07:27:33 INFO mapreduce.Job:  map 100% reduce 0%
15/02/04 07:27:34 INFO mapreduce.Job:  map 100% reduce 100%
15/02/04 07:27:34 INFO mapreduce.Job: Job job_1423034805647_0001 completed successfully
Job Finished in 29.302 seconds
15/02/04 07:27:34 INFO mapreduce.Job: Counters: 43
        File System Counters
                FILE: Number of bytes read=226
                FILE: Number of bytes written=882378
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2660
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=43
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=10
                Launched reduce tasks=1
                Data-local map tasks=10
                Total time spent by all maps in occupied slots (ms)=93289
                Total time spent by all reduces in occupied slots (ms)=7055
        Map-Reduce Framework
                Map input records=10
                Map output records=20
                Map output bytes=180
                Map output materialized bytes=280
                Input split bytes=1480
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=280
                Reduce input records=20
                Reduce output records=0
                Spilled Records=40
                Shuffled Maps =10
                Failed Shuffles=0
                Merged Map outputs=10
                GC time elapsed (ms)=1561
                CPU time spent (ms)=7210
                Physical memory (bytes) snapshot=2750681088
                Virtual memory (bytes) snapshot=11076927488
                Total committed heap usage (bytes)=2197291008
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1180
        File Output Format Counters
                Bytes Written=97
Estimated value of Pi is 3.14158440000000000000
========== Done ==========
Terminating Hadoop instance 'i-b3b27f5c' ... Done.
Removing key pair 'test-hadoop-key' ... Done.
Removing 'test-hadoop-sg' security group ... Done.
:~>

100

101

102

103

104

105

106

107

108

:~> ./aws-hadoop-single.sh

Creating 'test-hadoop-sg' security group ... Done.

Adding access rules to 'test-hadoop-sg' security group ... Done.

Generating key pair 'test-hadoop-key' for general access ... Done.

Starting Hadoop instance ... Done.

Tagging Hadoop instance 'i-b3b27f5c' ... Done.

Obtaining instance 'i-b3b27f5c' public hostname ... Done.

Waiting for instance 'i-b3b27f5c' to configure itself (will take approx. 5 minutes) ... Done.

========== Testing Single-node Hadoop ==========

Number of Maps = 10

Samples per Map = 1000000

Wrote input for Map #0

Wrote input for Map #1

Wrote input for Map #2

Wrote input for Map #3

Wrote input for Map #4

Wrote input for Map #5

Wrote input for Map #6

Wrote input for Map #7

Wrote input for Map #8

Wrote input for Map #9

Starting Job

15/02/04 07:27:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

15/02/04 07:27:05 INFO input.FileInputFormat: Total input paths to process : 10

15/02/04 07:27:05 INFO mapreduce.JobSubmitter: number of splits:10

15/02/04 07:27:05 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative

15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name

15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

15/02/04 07:27:05 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

15/02/04 07:27:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423034805647_0001

15/02/04 07:27:05 INFO impl.YarnClientImpl: Submitted application application_1423034805647_0001 to ResourceManager at /0.0.0.0:8032

15/02/04 07:27:05 INFO mapreduce.Job: The url to track the job: http://ip-10-63-188-40:8088/proxy/application_1423034805647_0001/

15/02/04 07:27:05 INFO mapreduce.Job: Running job: job_1423034805647_0001

15/02/04 07:27:11 INFO mapreduce.Job: Job job_1423034805647_0001 running in uber mode : false

15/02/04 07:27:11 INFO mapreduce.Job: map 0% reduce 0%

15/02/04 07:27:24 INFO mapreduce.Job: map 60% reduce 0%

15/02/04 07:27:33 INFO mapreduce.Job: map 100% reduce 0%

15/02/04 07:27:34 INFO mapreduce.Job: map 100% reduce 100%

15/02/04 07:27:34 INFO mapreduce.Job: Job job_1423034805647_0001 completed successfully

Job Finished in 29.302 seconds

15/02/04 07:27:34 INFO mapreduce.Job: Counters: 43

File System Counters

FILE: Number of bytes read=226

FILE: Number of bytes written=882378

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=2660

HDFS: Number of bytes written=215

HDFS: Number of read operations=43

HDFS: Number of large read operations=0

HDFS: Number of write operations=3

Job Counters

Launched map tasks=10

Launched reduce tasks=1

Data-local map tasks=10

Total time spent by all maps in occupied slots (ms)=93289

Total time spent by all reduces in occupied slots (ms)=7055

Map-Reduce Framework

Map input records=10

Map output records=20

Map output bytes=180

Map output materialized bytes=280

Input split bytes=1480

Combine input records=0

Combine output records=0

Reduce input groups=2

Reduce shuffle bytes=280

Reduce input records=20

Reduce output records=0

Spilled Records=40

Shuffled Maps =10

Failed Shuffles=0

Merged Map outputs=10

GC time elapsed (ms)=1561

CPU time spent (ms)=7210

Physical memory (bytes) snapshot=2750681088

Virtual memory (bytes) snapshot=11076927488

Total committed heap usage (bytes)=2197291008

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=1180

File Output Format Counters

Bytes Written=97

Estimated value of Pi is 3.14158440000000000000

========== Done ==========

Terminating Hadoop instance 'i-b3b27f5c' ... Done.

Removing key pair 'test-hadoop-key' ... Done.

Removing 'test-hadoop-sg' security group ... Done.

:~>

Hopefully, this short introduction will advance your efforts to automate development tasks in BigData projects!

If you want to discuss more complex scenarios including automated deployments over multi-node Hadoop clusters, AWS Elastic MapReduce, AWS DataPipeline or other components of the BigData ecosystem, do not hesitate to Contact Us!

References