Common issue in the Software Development Lifecycle is the need to quickly bootstrap vanilla environment, deploy some code onto it, run it and then scrap it. This is a core concept in Continuous Integration / Continuous Delivery (CI/CD). It is a stepping stone towards immutable infrastructure. Properly automated implementation can also save time (no need to configure it manually) and money (no need to track potential regression issues in the development process).
Over the course of several years, found this to be extremely useful when used in BigData projects that use Hadoop. Installation of Hadoop is not always straight-forward. It depends on various internal and external components (JDK, Map-Reduce Framework, HDFS, etc). It can be messy. Different components communicate over various ports and protocols. HDFS uses somewhat clumsy semantics to deal with files and directories. For those and similar reasons we decided to present our take on Hadoop installation on a single node for development purposes.
The following shell script is simplified, fully functional skeleton implementation that will install Hadoop on a c3.xlarge, Fedora 20 node in AWS and run a test job on it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
#!/bin/bash # Key file to be generated and its filesystem location KEY_NAME="test-hadoop-key" KEY_FILE="/tmp/$KEY_NAME" # Security group name and description SG_NAME="test-hadoop-sg" SG_DESC="Test Hadoop Security Group" # Temporary files; General Log and Instance User data LOG_FILE="/tmp/test-hadoop-setup.log" USR_DATA="/tmp/test-hadoop-userdata.sh" # Instance details AWS_PROFILE="$$profile$$" AWS_REGION="us-east-1" AMI_ID="ami-21362b48" INST_TAG="test-hadoop-single" INST_TYPE="c3.xlarge" DISK_SIZE="20" # Default return codes RET_CODE_OK=0 RET_CODE_ERROR=1 # Check for various utilities that will be used # Check for supported operating system P_UNAME=`whereis uname | cut -d' ' -f2` if [ ! -x "$P_UNAME" ]; then echo "$0: No UNAME available in the system" exit $RET_CODE_ERROR; fi OS=`$P_UNAME` if [ "$OS" != "Linux" ]; then echo "$0: Unsupported OS!"; exit $RET_CODE_ERROR; fi # Check if awscli is available in the system P_AWS=`whereis aws | cut -d' ' -f2` if [ ! -x "$P_AWS" ]; then echo "$0: No 'aws' available in the system!"; exit $RET_CODE_ERROR; fi # Check if awk is available in the system P_AWK=`whereis awk | cut -d' ' -f2` if [ ! -x "$P_AWK" ]; then echo "$0: No 'awk' available in the system!"; exit $RET_CODE_ERROR; fi # Check if grep is available in the system P_GREP=`whereis grep | cut -d' ' -f2` if [ ! -x "$P_GREP" ]; then echo "$0: No 'grep' available in the system!"; exit $RET_CODE_ERROR; fi # Check if sed is available in the system P_SED=`whereis sed | cut -d' ' -f2` if [ ! -x "$P_SED" ]; then echo "$0: No 'sed' available in the system!"; exit $RET_CODE_ERROR; fi # Check if ssh is available in the system P_SSH=`whereis ssh | cut -d' ' -f2` if [ ! -x "$P_SSH" ]; then echo "$0: No 'ssh' available in the system!"; exit $RET_CODE_ERROR; fi # Check if ssh-keygen is available in the system P_SSH_KEYGEN=`whereis ssh-keygen | cut -d' ' -f2` if [ ! -x "$P_SSH_KEYGEN" ]; then echo "$0: No 'ssh-keygen' available in the system!"; exit $RET_CODE_ERROR; fi # Userdata code to bootstrap Hadoop 2.X on Fedora 20 instance cat > $USR_DATA << "EOF" #!/bin/bash # Mark execution start echo "START" > /root/userdata.state # Install Hadoop yum --assumeyes install hadoop-common hadoop-common-native hadoop-hdfs hadoop-mapreduce hadoop-mapreduce-examples hadoop-yarn # Configure HDFS hdfs-create-dirs # Bootstrap Hadoop services systemctl start hadoop-namenode && sleep 2 systemctl start hadoop-datanode && sleep 2 systemctl start hadoop-nodemanager && sleep 2 systemctl start hadoop-resourcemanager && sleep 2 # Make Hadoop services start after reboot systemctl enable hadoop-namenode hadoop-datanode hadoop-nodemanager hadoop-resourcemanager # Configure Hadoop user runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -mkdir /user/fedora" runuser hdfs -s /bin/bash /bin/bash -c "hadoop fs -chown fedora /user/fedora" # Deploy additional software dependencies # ... # Deploy main application # ... # Mark execution end echo "DONE" > /root/userdata.state EOF # Create Security Group echo -n "Creating '$SG_NAME' security group ... " aws ec2 create-security-group --group-name $SG_NAME --description "$SG_DESC" --region $AWS_REGION --profile $AWS_PROFILE > $LOG_FILE echo "Done." # Add open SSH access echo -n "Adding access rules to '$SG_NAME' security group ... " aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 22 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE # Add open Hadoop ports access aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 8088 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50010 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50020 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50030 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50070 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50075 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE aws ec2 authorize-security-group-ingress --group-name $SG_NAME --protocol tcp --port 50090 --cidr 0.0.0.0/0 --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE echo "Done." # Generate New Key Pair and Import it echo -n "Generating key pair '$KEY_NAME' for general access ... " rm -rf $KEY_FILE $KEY_FILE.pub ssh-keygen -t rsa -f $KEY_FILE -N '' >> $LOG_FILE aws ec2 import-key-pair --key-name $KEY_NAME --public-key-material "`cat $KEY_FILE.pub`" --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE echo "Done." # Build the Hadoop box echo -n "Starting Hadoop instance ... " RI_OUT=`aws ec2 run-instances --image-id $AMI_ID --count 1 --instance-type $INST_TYPE --key-name $KEY_NAME --security-groups $SG_NAME --user-data file:///tmp/test-hadoop-userdata.sh --block-device-mapping "[{\"DeviceName\":\"/dev/sda1\", \"Ebs\":{\"VolumeSize\":$DISK_SIZE, \"DeleteOnTermination\": true} } ]" --region $AWS_REGION --profile $AWS_PROFILE` I_ID=`echo $RI_OUT | grep "InstanceId" | awk '{print $43}' | sed 's/,$//' | sed -e 's/^"//' -e 's/"$//'` echo $RI_OUT >> $LOG_FILE echo "Done." # Tag the Hadoop box echo -n "Tagging Hadoop instance '$I_ID' ... " aws ec2 create-tags --resources $I_ID --tags Key=Name,Value=$INST_TAG --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE echo "Done." # Obtain instance public IP address echo -n "Obtaining instance '$I_ID' public hostname ... " # Delays in AWS fabric, reiterate until public hostname is assigned ... while true; do sleep 3 HOST=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep PublicDnsName | awk -F":" '{print $2}' | awk '{print $1}' | sed 's/,$//' | sed -e 's/^"//' -e 's/"$//'`; if [[ $HOST == ec2* ]]; then break; fi done echo "Done." # Poll until system is ready echo -n "Waiting for instance '$I_ID' to configure itself (will take approx. 5 minutes) ... " while true; do sleep 5; TEMP_OUT=`ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE -t fedora@$HOST "sudo cat /root/userdata.state"`; # Clear some strange symbols STATE=`echo $TEMP_OUT | cut -c1-4`; if [ "$STATE" = "DONE" ]; then break; fi done echo "Done." # Test Hadoop setup echo "========== Testing Single-node Hadoop ==========" ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar /usr/share/java/hadoop/hadoop-mapreduce-examples.jar pi 10 1000000" echo "========== Done ==========" # Run main Application here # echo "========== Testing Main Application Single-node Hadoop ==========" # ssh -q -o "StrictHostKeyChecking=no" -i $KEY_FILE fedora@$HOST "hadoop jar ..." # echo "========== Done ==========" # Terminate instance echo -n "Terminating Hadoop instance '$I_ID' ... " aws ec2 terminate-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE # Poll until instance is terminated while true; do sleep 5; TERMINATED=`aws ec2 describe-instances --instance-ids $I_ID --region $AWS_REGION --profile $AWS_PROFILE | grep terminated`; if [ ! -z "$TERMINATED" ]; then break; fi done echo "Done." # Remove SSH Keypair echo -n "Removing key pair '$KEY_NAME' ... " aws ec2 delete-key-pair --key-name $KEY_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE echo "Done." # Remove Security Group echo -n "Removing '$SG_NAME' security group ... " aws ec2 delete-security-group --group-name $SG_NAME --region $AWS_REGION --profile $AWS_PROFILE >> $LOG_FILE echo "Done." # Remove local resources rm -rf $USR_DATA rm -rf $KEY_FILE $KEY_FILE.pub rm -rf $LOG_FILE # Normal termination exit $RET_CODE_OK |
Additional notes:
- Please, edit the AWS_PROFILE variable. AWS CLI commands depend on this!
- Activity log is kept in /tmp/test-hadoop-setup.log and will be recreated with every new run of the script.
- In case of normal execution, all allocated resources will be cleaned upon termination.
- This script is ready to be used as Jenkins build-and-deploy job.
- Since the single-node Hadoop/HDFS is terminated, output data that goes to HDFS should be transferred out of the instance before termination!
Example run should look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
:~> ./aws-hadoop-single.sh Creating 'test-hadoop-sg' security group ... Done. Adding access rules to 'test-hadoop-sg' security group ... Done. Generating key pair 'test-hadoop-key' for general access ... Done. Starting Hadoop instance ... Done. Tagging Hadoop instance 'i-b3b27f5c' ... Done. Obtaining instance 'i-b3b27f5c' public hostname ... Done. Waiting for instance 'i-b3b27f5c' to configure itself (will take approx. 5 minutes) ... Done. ========== Testing Single-node Hadoop ========== Number of Maps = 10 Samples per Map = 1000000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 15/02/04 07:27:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/02/04 07:27:05 INFO input.FileInputFormat: Total input paths to process : 10 15/02/04 07:27:05 INFO mapreduce.JobSubmitter: number of splits:10 15/02/04 07:27:05 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 15/02/04 07:27:05 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 15/02/04 07:27:05 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 15/02/04 07:27:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423034805647_0001 15/02/04 07:27:05 INFO impl.YarnClientImpl: Submitted application application_1423034805647_0001 to ResourceManager at /0.0.0.0:8032 15/02/04 07:27:05 INFO mapreduce.Job: The url to track the job: http://ip-10-63-188-40:8088/proxy/application_1423034805647_0001/ 15/02/04 07:27:05 INFO mapreduce.Job: Running job: job_1423034805647_0001 15/02/04 07:27:11 INFO mapreduce.Job: Job job_1423034805647_0001 running in uber mode : false 15/02/04 07:27:11 INFO mapreduce.Job: map 0% reduce 0% 15/02/04 07:27:24 INFO mapreduce.Job: map 60% reduce 0% 15/02/04 07:27:33 INFO mapreduce.Job: map 100% reduce 0% 15/02/04 07:27:34 INFO mapreduce.Job: map 100% reduce 100% 15/02/04 07:27:34 INFO mapreduce.Job: Job job_1423034805647_0001 completed successfully Job Finished in 29.302 seconds 15/02/04 07:27:34 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=226 FILE: Number of bytes written=882378 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=2660 HDFS: Number of bytes written=215 HDFS: Number of read operations=43 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=10 Launched reduce tasks=1 Data-local map tasks=10 Total time spent by all maps in occupied slots (ms)=93289 Total time spent by all reduces in occupied slots (ms)=7055 Map-Reduce Framework Map input records=10 Map output records=20 Map output bytes=180 Map output materialized bytes=280 Input split bytes=1480 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=280 Reduce input records=20 Reduce output records=0 Spilled Records=40 Shuffled Maps =10 Failed Shuffles=0 Merged Map outputs=10 GC time elapsed (ms)=1561 CPU time spent (ms)=7210 Physical memory (bytes) snapshot=2750681088 Virtual memory (bytes) snapshot=11076927488 Total committed heap usage (bytes)=2197291008 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=1180 File Output Format Counters Bytes Written=97 Estimated value of Pi is 3.14158440000000000000 ========== Done ========== Terminating Hadoop instance 'i-b3b27f5c' ... Done. Removing key pair 'test-hadoop-key' ... Done. Removing 'test-hadoop-sg' security group ... Done. :~> |
Hopefully, this short introduction will advance your efforts to automate development tasks in BigData projects!
If you want to discuss more complex scenarios including automated deployments over multi-node Hadoop clusters, AWS Elastic MapReduce, AWS DataPipeline or other components of the BigData ecosystem, do not hesitate to Contact Us!
References
Related Posts
- UserData Template for Ubuntu 14.04 EC2 Instances in AWS
- Small Tip: How to use –block-device-mappings to manage instance volumes with AWS CLI
- How to implement multi-cloud deployment for scalability and reliability
- Small Tip: How to use AWS CLI ‘–filter’ parameter
- Small Tip: How to use AWS CLI to start Spot instances with UserData
code
more code
~~~~