How to deploy single-node Hadoop setup in AWS

2015/02/04 AWS, Big Data, Development, DevOps, Operations No comments , , , , , ,

Common issue in the Software Development Lifecycle is the need to quickly bootstrap vanilla environment, deploy some code onto it, run it and then scrap it. This is a core concept in Continuous Integration / Continuous Delivery (CI/CD). It is a stepping stone towards immutable infrastructure. Properly automated implementation can also save time (no need to configure it manually) and money (no need to track potential regression issues in the development process).

Over the course of several years, found this to be extremely useful when used in BigData projects that use Hadoop. Installation of Hadoop is not always straight-forward. It depends on various internal and external components (JDK, Map-Reduce Framework, HDFS, etc). It can be messy. Different components communicate over various ports and protocols. HDFS uses somewhat clumsy semantics to deal with files and directories. For those and similar reasons we decided to present our take on Hadoop installation on a single node for development purposes.

The following shell script is simplified, fully functional skeleton implementation that will install Hadoop on a c3.xlarge, Fedora 20 node in AWS and run a test job on it:

Additional notes:

  • Please, edit the AWS_PROFILE variable. AWS CLI commands depend on this!
  • Activity log is kept in /tmp/test-hadoop-setup.log and will be recreated with every new run of the script.
  • In case of normal execution, all allocated resources will be cleaned upon termination.
  • This script is ready to be used as Jenkins build-and-deploy job.
  • Since the single-node Hadoop/HDFS is terminated, output data that goes to HDFS should be transferred out of the instance before termination!

Example run should look like:

Hopefully, this short introduction will advance your efforts to automate development tasks in BigData projects!

If you want to discuss more complex scenarios including automated deployments over multi-node Hadoop clusters, AWS Elastic MapReduce, AWS DataPipeline or other components of the BigData ecosystem, do not hesitate to Contact Us!


UserData Template for Ubuntu 14.04 EC2 Instances in AWS

2015/01/27 AWS, Development, DevOps, Operations No comments , , , ,

In any elastic environment there is a recurring issue: How to quickly spin up new boxes? Over time multiple options emerge. Many environments will rely on a pre-baked machine instances. In Amazon AWS those are called Amazon Machine Instances (AMIs), in Joyent’s SDC – images, but no matter the name they present pre-build, (mostly) pre-configured digital artifact that the underlying cloud layer will bootstrap and execute. They are fast to bootstrap, but limited. Hard to manage different versions, hard to switch virtualization technologies (PV vs. HVM, AWS vs. Joyent, etc), hard to deal with software versioning. Managing elastic environment with pre-baked images is probably the fastest way to start, but probably the most expensive way in the long run.

Another option is to use some sort of configuration management system. Chef, Puppet, Salt, Ansible … a lot of choices. Those are flexible, but depending on the usage scenarios can be slow and may require additional “interventions” to work properly. There are two additional “gotchas” that are not commonly discussed. First, those tools will force some sort in-house configuration/pseudo-programming language and terminology. Second, security is a tricky concept to implement within such system. Managing elastic environments with configuration management systems is definitely possible, but comes with some dependencies and prerequisites you should account for in the design phase.

Third option, AWS UserData / Joyent script, is a reasonable compromise. This is effectively a script that executes one upon virtual machine creation. It allows you to configure the instance, attach/configure storages, install software, etc. There are obvious benefits to that approach:

  • Treat that script like any other coding artifact, use version control, code reviews, etc;
  • It is easily modifiable upon need or request;
  • It can be used with virtually any instance type;
  • It is a single source of truth for the instance configuration;
  • It integrates nicely with the whole Control Plane concept.

Here is a basic template for Ubuntu 14.04 used with reasonable success to cover wide variety of deployment needs:

Trivial. Yet, incorporates a lot in just ~200 lines of code:

  1. Disk layout management;
  2. Package repositories configuration;
  3. Basic tool set and third party software installation;
  4. Service reconfiguration (NTP, Automatic security updates);
  5. System reconfiguration (limits, sysctl, users, directories, crontab);
  6. Post-reboot startup configuration;
  7. Identity discovery and self-tagging;

As added bonus, the cloud-init package will properly log all output during the script execution in /var/log/cloud-init-output.log for failure investigations. Current script uses -ex bash parameters, which means it will explicitly echo all executed commands (-x) and exit at first sign of unsuccessful command execution (-e).

NOTE: There is one important component, purposefully omitted from the template UserData, the log file management. We plan on discussing that in a separate article.


Small Tip: How to use AWS CLI ‘–filter’ parameter

2015/01/20 AWS, DevOps, Operations, Small Tip 5 comments , , , , , ,

This post will present another, useful feature of the AWS CLI tool set, the –filter parameter. This command line parameter is available and extremely helpful in EC2 namespace (aws ec2 describe-*).There are various ways to use –filter parameter.

1. –filter parameter can get filtering properties directly from the command line:

2. –filter parameter will also use JSON-encoded filter file:

The filters.json file uses the following structure:

There are various AWS CLI components that provide –filter parameters. For additional information check the References section.

To demonstrate the way this functionality can be used in various scenarios, there are several examples:

1. Filter by availability zone:

2. Filter by security group (EC2-Classic):

3. Filter by security group (EC2-VPC):

4. Filter only spot instances

5. Filter only running EC2 instances:

6. Filter only stopped EC2 instances:

7. Filter by SSH Key name

8. Filter by Tag:

9. Filter by Tag with a wildcard (‘*’):

10. Filter by multiple criteria (all running instances with string ’email’ in the value of the Name tag):

11. Filter by multiple criteria (all running instances with empty Name tag);

Those examples are very close to production ones used in several large AWS deployments. They are used to:

  • Monitor changes in instance populations;
  • Monitor successful configuration of resources;
  • Track deployment / rollout of new software version;
  • Track stopped instances to prevent unnecessary resource usage;
  • Ensure desired service distributions over availability zones and regions;
  • Ensure service distribution over instances with different lifecycle;

Be sure to utilize this functionality in your monitoring infrastructure. It has been powerful source of operational insights and great source of raw data for our intelligent control planes!

If you want to talk more on this subject or just share your experience, do not hesitate to Contact Us!


AWS, DevOps, Outsourcing …

2014/12/08 AWS, DevOps, Operations, Xi Group Ltd. No comments , , , , ,

Is it possible to outsource DevOps?!

We asked ourselves that exact question before Xi Group Ltd. ventured into offering DevOps services. And the answer did not come easy. DevOps is cultural phenomenon. DevOps relies on close communication and by extension is location-dependent. DevOps is also technological phenomenon. Software has to be created to implement it. So it seemed that outsourcing is not really viable for DevOps …

Several projects and many months later, we know that this is not true. DevOps can be outsourced. External company can be integral part of DevOps strategy and day-to-day operational activities. Such cooperation can be beneficial from cultural and technological perspectives.

From a cultural perspective, choosing specialized DevOps partner can help you in several ways. They will help you break the “enterprise silo” model and mindset. This one is especially hard, even with some form of internal governance and support. Developers under pressure will keep writing code and let “ops” deal with it later. Any operational shortcomings will be attributed to the “ops guys” because once you ship it “it’s their problem”. Knowledgeable third party, not invested in any of the teams, can facilitate a communication flow between those teams speaking their language, explaining core architectural and operational principles, demanding proper implementation and keeping eye on the final goal: easy deployments of functional components. Communication is key! And successful outsourcing partner will know this. Will be vocal and active in all phases of the Software Development Lifecycle. It is the experience that comes from complex deployments of large scale distributed systems, that you should look for in your DevOps partners.

From a technology perspective, choosing specialized DevOps partner can also be beneficial endeavor. The chance is they already have several projects behind their back. They have the basic tooling already developed and can shorten your implementation lifecycle with components and know-how you should otherwise develop on your own. Proper DevOps partner will supply you with proper technology choices for the components you develop/operate. And NO, Docker is not always the answer! There are other ways to achieve immutable infrastructure. Build process automation, deployment automation, monitoring and log processing are already part of our daily arsenal of tools. We developed those, validated their usability in production environments … and now you can benefit from them too! Those the basics you should expect from a DevOps partner. You should expect active input on technology, operational requirements, non-functional requirements, predictability requirements, monitoring and scalability. Anything less … is not DevOps.

So … Is it possible to outsource DevOps?! Yes, we believe it is!

What should you look for in a partner? Experience with complex deployments and proper tooling to shorten your implementation cycle … as a start.

Do you want to know more?! … Contact us!

Small Tip: How to use –block-device-mappings to manage instance volumes with AWS CLI

2014/11/26 AWS, Development, DevOps, Operations, Small Tip , , , , ,

This post will present one of the less popular features in the AWS CLI tool set, how to deal with EC2 instance volumes through the use of –block-device-mappings parameter. Previous post, Small Tip: Use AWS CLI to create instances with bigger root partitions already presents one of the common use cases, modifying the instance root partition size. However, use of ‘–block-device-mappings’ can go far beyond this simple feature.

Default documentation ( although a good start is somewhat limited. Several tips and tricks will be presented here.

The location of the JSON block device mapping specification can be quite flexible. The mappings can be supplied:

1. Using command line directly:

2. Using file as a source:

3. Using URL as a source:


Other common scenarios:

1. To reorder default ephemeral volumes to ensure stability of the environment:

NOTE: Useful for additional UserData processing or deployments with hardcoded settings.

2. To allocate additional EBS Volume with specific size (100GB), to be associated with the EC2 instance:

NOTE: Useful for cases where cheaper instance types are outfitted with big volumes (Disk intensive tasks run on low-CPU/MEM instance types).

3. To allocate new volume from Snapshot ID:

NOTE: Useful to pre-loading newly created instances with specific disk data and still retaining the ability to modify the local copy.

4. To omit mapping of a particular Device Name:

NOTE: Useful to overwrite default AWS behavior.

5. To allocate new EBS Volume with explicit termination behavior (Keep after instance termination):

NOTE: Useful to keep instance data after termination, additional cost may be significant if those volumes are not released after examination.

6. To allocate new, encrypted, EBS Volume with Reserved IOPS:

NOTE: Useful to set minimum required performance levels (I/O Operations Per Second) for the specified volume.

Outlined functionality should cover wide range of potentially use cases for DevOps engineers who want to use automation to customize their infrastructure. Flexible instance volume management is a key ingredient for successful implementation of the ‘Infrastructure-as-Code’ paradigm!


How to implement multi-cloud deployment for scalability and reliability

2014/07/18 AWS, Development, DevOps, Operations, theCloud , , , , , , , , ,


This post will present interesting approach to scalability and reliability:

How to implement multi-cloud application deployment ?!

There are many reasons why this is interesting topic. Avoiding provider lockdown, reducing cloud provider outage impact, increasing world-wide coverage, disaster recovery / preparedness are only some of them. The obvious benefits of multi-cloud deployment are increased reliability and outage impact minimization. However, there are drawbacks too: supporting different sets of code to accommodate similar, but different services, increased cost, increased infrastructure complexity, different tools … Yet, despite the drawbacks, the possible benefits far outweigh the negatives!

In the following article a simple service will be deployed in automated fashion over two different Cloud Service Providers: Amazon AWS and Joyent. Third provider, CloudFlare, will be used to service DNS requests. The choice of providers is not random. They are chosen because of particular similarities and because the ease of use. All of those providers have consistent, comprehensive APIs that allow automation through programming in parallel to the command line tools.

Preliminary information

The service setup, described here, although synthetic, is representative of multiple usage scenarios. More complex scenarios are also possible. Special care should be taken to address use of common resources or non-replicable resources/states. Understand the dependencies of your application architecture before using multi-cloud setup. Or contact Xi Group Ltd. to aid you in this process!

The following Cloud Service Providers will be used to deploy executable code on:

DNS requests will be served by CloudFlare. The test domain is:

Required tools are:

Additional information can be found in AWS CLI, Joyent CloudAPI Documentation and CloudFlare ClientAPI.

Implementation Details

A service, website for, has to be deployed over multiple clouds. For simplicity, it is assumed that this is a static web site, served by NginX. It will run on Ubuntu 14.04 LTS. Instance types chosen in both AWS and Joyent are pretty limited, but should provide enough computing power to run NginX and serve static content. CloudFlare must be configured with basic settings for the DNS zone it will serve (in this case, the free CloudFlare account is used).

Each computing instance, when bootstrapped or restarted, will start the NginX and register itself in CloudFlare. At that point it should be able to receive client traffic. Upon termination or shutdown, each instance should remove its own entries from CloudFlare thus preventing DNS zone pollution with dead entries. In a previous article, How to implement Service Discovery in the Cloud, it was discussed how DNS-SD can be implemented for similar setup with increased client complexity. In a multi-tier architecture this a proper solution. However, lack of control over the client browser may prove that a simplistic solution, like the one described here, is a better choice.


CloudFlare setup uses the free account and one domain,, is configured:

Screen Shot 2014-07-18 at 1.18.32 PM

Basic configuration includes only one entry for the zone name:

Screen Shot 2014-07-18 at 1.19.03 PM

As seen by the orange cloud icon, the requests for this record will be routed through CloudFlare’s network for inspection and analysis!

AWS UserData / Joyent Script

To automate the process of configuring instances, the following UserData script will be used:

This UserData script contains three components:

  1. Lines 0 – 62: Boilerplate, OS update, installation and configuration of NginX;

  2. Lines 64 – 215:, main script that will be called on startup and shutdown of the instance. will register the instance’s public IP address with CloudFlare and set required protection. By default, protection and acceleration is off. Additional configuration is required to make this script work for your setup, account details must be configured in the specified variables!

  3. Lines 217 – 228: Setting proper script permissions, configuring automatic start of and executing it to register with CloudFlare.

Code is reasonably straight-forward. init.d startup script is divided to multiple functions and output is redirected to a log file for debugging purposes. External dependencies are kept to a minimum. Distinguishing between AWS EC2 and Joyent instances is done by analyzing the instance ID. In AWS, all EC2 instances have instance IDs starting with ‘i-‘, while Joyent uses (by the looks of it) some sort of UUID. This part of the logic is particularly important if the code should be extended to support other cloud providers!

Both AWS and Joyent offer Ubuntu 14.04 support, so the same code can be use to configure the instances in automated fashion. This is particularly handy when it comes to data driven instance management and the DRY principle. Command line tools for both cloud providers also offer similar syntax, which makes it easier to utilize this functionality.

Amazon AWS

Staring new instances within Amazon AWS is straight-forward, assuming awscli is properly configured:


Starting news instances within Joyent is somewhat more complex, but there is comprehensive documentation:

This particular example will start new SmartMachine instance using the 4dad8aa6-2c7c-e20a-be26-c7f4f1925a9a package (g3-devtier-0.25-kvm, 3rd generation, virtual machine (KVM) with 256MB RAM) and 286b0dc0-d09e-43f2-976a-bb1880ebdb6c (ubuntu-certified-14.04) image. SSH key details are supplied through the specific combinations of Web-interface settings and SSH key signature. For the list of available packages (instance types) and images (software stacks) consult the API: ListPackages, ListImages.

NOTE: Joyent offers rich Metadata support, which can be quite flexible tool when managing large number of instances!

Successful service configuration

Successful service configuration will result in proper DNS entries to be added to the DNS zone in CloudFlare:

Screen Shot 2014-07-18 at 4.12.43 PM

After configured TTL, those should be visible world-wide:

As seen, both AWS ( and Joyent ( IP addresses are returned, i.e. DNS Round-Robin. Service can simply be tested with:

Resulting calls can be seen in the NginX log files on both instances:

Screen Shot 2014-07-18 at 5.30.50 PM

NOTE: CloudFlare protection and acceleration features are explicitly disabled in this example! It is strongly suggested to enabled them for production purposes!


It should be clear now, that whenever software architecture follows certain design principles and application is properly decoupled in multiple tiers, the whole system can be deployed within multiple cloud providers. DevOps principles for automated deployment can be implemented in this environment as well. The overall system is with improved scalability, reliability and in case of data driven elastic deployments, even cost! Proper design is key, but the technology provided by companies like Amazon and Joyent make it easier to turn whiteboard drawings into actual systems with hundreds of nodes!


Small Tip: How to use AWS CLI to start Spot instances with UserData

2014/07/12 AWS, DevOps, Operations, Small Tip , , , ,

Common occurrence in the list of daily DevOps tasks is the one to deal with AWS EC2 Spot Instances. They offer the same performance, as the OnDemand counterparts, they are cheap to the extend that user can specify the hourly price. The drawback is that AWS can reclaim them if the market price goes beyond the user’s price. Still, those are key component, a basic building block, in every modern elastic system. As such, DevOps engineers must regularly interact with those.

AWS provides proper command line interface, aws ec2 request-spot-instances exposes multiple options to the user. However, some of the common use cases are not comprehensively covered in the documentation. For example, creating Spot Instances with Userdata using the command line tools is somewhat obscure and convoluted, although common need in DevOps and Developers lives. The tricky part: It must be BASE64 encoded!

Assume the following, simple UserData script, must be deployed on numerous EC2 Spot Instances:

Make sure base64 command is available in your system, or use equivalent, to encode the sample file before passing to the launch specification:

In this example two spot instance requests will be created for m3.medim instances, using ami-a6926dce AMI, test-key SSH key, running in test-sg Security Group. BASE64-encoded contents of will be attached to the request so upon fulfillment the Userdata will be passed to the newly created instances and executed after boot-up.

Spot instance requests will be created in the AWS EC2 Dashboard:

Screen Shot 2014-07-12 at 9.11.20 PM

Once the Spot Instance Requests (SIRs) are fulfilled, InstanceID will be associated for each SIR:

Screen Shot 2014-07-12 at 9.18.24 PM

EC2 Instances dashboard will show newly created Spot Instances (notice the “Lifecycle: spot” in Instance details):

Screen Shot 2014-07-12 at 9.20.30 PM

Using the proper credentials, one can verify successful execution of the on each instance:

… and more importantly, if the configured service works as expected:

Newly created Spot Instances are serving traffic, running at 0.01 USD/hr and will happily do so until the market price for this instance type goes above the specified price!


Small Tip: AWS announces T2 instance types

2014/07/04 AWS, Development, DevOps, Operations, Small Tip , , , , , , ,

One of the oldest and probably one of the most popular instance types, the t1.micro was recently upgraded by AWS. Three new instance types were introduced to fill the gap between t1.micro and the current-next, m3.medium. The new generation is called T2, uses only HVM based virtualization and comes with EBS only store support. There are three new instance types:

  1. t2.micro
  2. t2.small
  3. t2.medium

Those instance types are all “Burstable Performance Instances” which means they are suitable for unsustained loads. This is also supported by the EBS Only store, which effectively means that high-volume I/O is out of the question. The fact that those instances are all using HVM-based virtualization, however, supports quick SCALE-UP to more potent instance types, if needs arise. One notable remark here is that T2 instances are VPC-only, which is a strong indication of the will to move everything into VPCs nowadays. AWS wants you to start using VPCs from the start!

The instance resource matrix now looks like this:

Instance Type Virtualization Type CPU Cores Memory Storage
t1.micro PV 1 0.613 GB EBS Only
t2.micro HVM 1 1 GB EBS Only
m1.small PV 1 1.7 GB EBS Only
t2.small HVM 1 2 GB EBS Only
m3.medium HVM 1 3.75 GB EBS + SSD
t2.medium HVM 2 4 GB EBS Only

As stated by AWS, the target uses for the new, T2 instance type family, includes:

  • Development environments;
  • Private experimentation;
  • Educational use;
  • Build servers / Code repositories;
  • Low-traffic web applications;
  • Small databases.

To evaluate the meaning of “Burstable Performance Instances“, here are CPU benchmark results on several instance instance types:

Instance Type DES crypts/s MD5 crypts/s Blowfish crypts/s Generic crypts/s
t1.micro ~ 2 407 000 ~ 6 869 ~ 442 ~ 187 257
t2.micro ~ 4 757 000 ~ 14 164 ~ 851 ~ 344 928
m1.small ~ 1 218 000 ~ 3 480 ~ 222 ~ 92 870
t2.small ~ 4 993 000 ~ 14 245 ~ 854 ~ 347 961
m3.medium ~ 2 272 000 ~ 6 429 ~ 386 ~ 158 342
t2.medium ~ 5 045 000 ~ 14 592 ~ 878 ~ 356 544

All instances use detault settings for storage, Amazon Linux AMI 2014.03.2, John The Ripper 1.8.0, measuring real crypts with many salts! The test is fairly synthetic, but answers the key question: What difference does it make to have a Burstable instance type? And the answer: If CPU load is not sustainable, it’s more than twice as fast!

Price-wise the new instance types are also better. Cost reduction of On Demand prices of more than 35% allows you to run t2.micro for less than 10 USD/m! Watch out, DigitalOcean! Obviously, Amazon wants change the already established “AWS for business, DigitalOcean for home” mantra into “AWS Everywhere”.

In conclusion, the new, T2 instance type family, closes the gap between unacceptably low performance instance type (t1.micro) and too expensive instances types (m1.small, m3.medium) which creates the sweet-spot for entry users, cloud enthusiast and home users. As someone said: “Now you have an instance type to run WordPress on!”