How to implement multi-cloud deployment for scalability and reliability

2014/07/18 AWS, Development, DevOps, Operations, theCloud , , , , , , , , ,


This post will present interesting approach to scalability and reliability:

How to implement multi-cloud application deployment ?!

There are many reasons why this is interesting topic. Avoiding provider lockdown, reducing cloud provider outage impact, increasing world-wide coverage, disaster recovery / preparedness are only some of them. The obvious benefits of multi-cloud deployment are increased reliability and outage impact minimization. However, there are drawbacks too: supporting different sets of code to accommodate similar, but different services, increased cost, increased infrastructure complexity, different tools … Yet, despite the drawbacks, the possible benefits far outweigh the negatives!

In the following article a simple service will be deployed in automated fashion over two different Cloud Service Providers: Amazon AWS and Joyent. Third provider, CloudFlare, will be used to service DNS requests. The choice of providers is not random. They are chosen because of particular similarities and because the ease of use. All of those providers have consistent, comprehensive APIs that allow automation through programming in parallel to the command line tools.

Preliminary information

The service setup, described here, although synthetic, is representative of multiple usage scenarios. More complex scenarios are also possible. Special care should be taken to address use of common resources or non-replicable resources/states. Understand the dependencies of your application architecture before using multi-cloud setup. Or contact Xi Group Ltd. to aid you in this process!

The following Cloud Service Providers will be used to deploy executable code on:

DNS requests will be served by CloudFlare. The test domain is:

Required tools are:

Additional information can be found in AWS CLI, Joyent CloudAPI Documentation and CloudFlare ClientAPI.

Implementation Details

A service, website for, has to be deployed over multiple clouds. For simplicity, it is assumed that this is a static web site, served by NginX. It will run on Ubuntu 14.04 LTS. Instance types chosen in both AWS and Joyent are pretty limited, but should provide enough computing power to run NginX and serve static content. CloudFlare must be configured with basic settings for the DNS zone it will serve (in this case, the free CloudFlare account is used).

Each computing instance, when bootstrapped or restarted, will start the NginX and register itself in CloudFlare. At that point it should be able to receive client traffic. Upon termination or shutdown, each instance should remove its own entries from CloudFlare thus preventing DNS zone pollution with dead entries. In a previous article, How to implement Service Discovery in the Cloud, it was discussed how DNS-SD can be implemented for similar setup with increased client complexity. In a multi-tier architecture this a proper solution. However, lack of control over the client browser may prove that a simplistic solution, like the one described here, is a better choice.


CloudFlare setup uses the free account and one domain,, is configured:

Screen Shot 2014-07-18 at 1.18.32 PM

Basic configuration includes only one entry for the zone name:

Screen Shot 2014-07-18 at 1.19.03 PM

As seen by the orange cloud icon, the requests for this record will be routed through CloudFlare’s network for inspection and analysis!

AWS UserData / Joyent Script

To automate the process of configuring instances, the following UserData script will be used:

This UserData script contains three components:

  1. Lines 0 – 62: Boilerplate, OS update, installation and configuration of NginX;

  2. Lines 64 – 215:, main script that will be called on startup and shutdown of the instance. will register the instance’s public IP address with CloudFlare and set required protection. By default, protection and acceleration is off. Additional configuration is required to make this script work for your setup, account details must be configured in the specified variables!

  3. Lines 217 – 228: Setting proper script permissions, configuring automatic start of and executing it to register with CloudFlare.

Code is reasonably straight-forward. init.d startup script is divided to multiple functions and output is redirected to a log file for debugging purposes. External dependencies are kept to a minimum. Distinguishing between AWS EC2 and Joyent instances is done by analyzing the instance ID. In AWS, all EC2 instances have instance IDs starting with ‘i-‘, while Joyent uses (by the looks of it) some sort of UUID. This part of the logic is particularly important if the code should be extended to support other cloud providers!

Both AWS and Joyent offer Ubuntu 14.04 support, so the same code can be use to configure the instances in automated fashion. This is particularly handy when it comes to data driven instance management and the DRY principle. Command line tools for both cloud providers also offer similar syntax, which makes it easier to utilize this functionality.

Amazon AWS

Staring new instances within Amazon AWS is straight-forward, assuming awscli is properly configured:


Starting news instances within Joyent is somewhat more complex, but there is comprehensive documentation:

This particular example will start new SmartMachine instance using the 4dad8aa6-2c7c-e20a-be26-c7f4f1925a9a package (g3-devtier-0.25-kvm, 3rd generation, virtual machine (KVM) with 256MB RAM) and 286b0dc0-d09e-43f2-976a-bb1880ebdb6c (ubuntu-certified-14.04) image. SSH key details are supplied through the specific combinations of Web-interface settings and SSH key signature. For the list of available packages (instance types) and images (software stacks) consult the API: ListPackages, ListImages.

NOTE: Joyent offers rich Metadata support, which can be quite flexible tool when managing large number of instances!

Successful service configuration

Successful service configuration will result in proper DNS entries to be added to the DNS zone in CloudFlare:

Screen Shot 2014-07-18 at 4.12.43 PM

After configured TTL, those should be visible world-wide:

As seen, both AWS ( and Joyent ( IP addresses are returned, i.e. DNS Round-Robin. Service can simply be tested with:

Resulting calls can be seen in the NginX log files on both instances:

Screen Shot 2014-07-18 at 5.30.50 PM

NOTE: CloudFlare protection and acceleration features are explicitly disabled in this example! It is strongly suggested to enabled them for production purposes!


It should be clear now, that whenever software architecture follows certain design principles and application is properly decoupled in multiple tiers, the whole system can be deployed within multiple cloud providers. DevOps principles for automated deployment can be implemented in this environment as well. The overall system is with improved scalability, reliability and in case of data driven elastic deployments, even cost! Proper design is key, but the technology provided by companies like Amazon and Joyent make it easier to turn whiteboard drawings into actual systems with hundreds of nodes!


How to implement Service Discovery in the Cloud

2014/06/17 AWS, Development, DevOps, theCloud No comments , , , , , , , ,


Service Discovery is not new technology. Unfortunately, it is barely understood and rarely implemented. It is a problem that many system architects face and it is key to multiple desirable qualities of a modern, cloud enabled, elastic distributed system such as reliability, availability, maintainability. There are multiple ways to approach service discovery:

  • Hardcode service locations;
  • Develop proprietary solution;
  • Use existing technology.

Hardcoding is still the common case. How often do you encounter hardcoded URLs in configuration files?! Developing proprietary solution becomes popular too. Multiple companies decided to address Service Discovery by implementing some sort of distributed key-value store. Amongst the popular ones: Etsy’s etcd, Heroku’s Doozer, Apache ZooKeeper, Google’s Chubby. Even Redis can used for such purposes. But for many cases additional software layers and programming complexity is not needed. There is already existing solution based on DNS. It is called DNS-SD and is defined in RFC6763.

DNS-SD utilizes PTR, SRV and TXT DNS records to provide flexible service discovery. All major DNS implementations support it. All major cloud providers support it. DNS is well established technology, well understood by both Operations and Development personnel with strong support in programming languages and libraries. It is highly-available by replication.

How does DNS-SD work?

DNS-SD uses three DNS records types: PTR, SRV, TXT:

  • PTR record is defined in RFC1035 as “domain name pointer”. Unlike CNAME records no processing of the contents is performed, data is returned directly.
  • SRV record is defined in RFC2782 as “service locator”. It should provide protocol agnostic way to locate services, in contrast to the MX records. It contains four components: priority, weight, port and target.
  • TXT record is defined in RFC1035 as “text string”.

There are multiple specifics around protocol and service naming conventions that are beyond the scope of this post. For more information please refer to RFC6763. For the purposes of this article, it is assumed that a proprietary TCP-based service, called theService that has different reincarnations runs on TCP port 4218 on multiple hosts. The basic idea is:

  1. Create a pointer record for _theSerivce that contains all available incarnations of the service;
  2. For each incarnation create SRV record (where the service is located) and TXT record (any additional information for the client) that specify the service details.

This is what sample configuration looks like in AWS Route53 for the domain:
Screen Shot 2014-06-14 at 10.41.42 PM

Using nslookup results can be verified:

Now a client that wants to use incarnation1 of theService has means to access it (Host:, Port: 4218).

Load-balaing can be implementing by adding another entry in the service locator record with the same priority and weight:
Screen Shot 2014-06-14 at 10.42.28 PM

Resulting DNS lookup:

In a similar way, fail-over can be implemented by using different priority (or load distribution using different weights):
Screen Shot 2014-06-14 at 10.54.41 PM

Resulting DNS lookup:

NOTE: With DNS the client is the one to implement the load-balacing or the fail-over (although there are exceptions to this rule)!

Benefits of using DNS-SD for Service Discovery

This technology can be used to support multiple version of a service. Using the built-in support for different reincarnations of the same service, versioning can be implemented in clean granular way. Common problem in REST system, usually solved by nasty URL schemes or rewriting URLs. With DNS-SD required metadata can be passed through the TXT records and multiple versions of the communication protocol can be supported, each in contained environment … No name space pollution, no clumsy URL schemes, no URL rewriting …

This technology can be utilized to reduce complexity while building distributed systems. The clients will most certainly go through the process of name resolution anyway, so why not incorporate service discover in it?! Instead of dealing with external system (installation, operation, maintenance) and all the possible issues (hard to configure, hard to maintain, immature, fault-intollerant, requires additional libraries in the codebase, etc), incorporate this with the name resolution. DNS is well supported on virtually all operating systems and with all programming languages that provide network programming abilities. System architecture complexity is reduced because subsystem that already exists is providing additional services, instead of introducing new systems.

This technology can be utilized to increase reliability / fault-tolerance. Reliability / fault-tolerance can be easily increased by serving multiple entries with the service locator records. Priority can be used by the client to go through the list of entries in controlled manner and weight to balance the load between the service providers on each priority level. The combination of backend support (control plane updating DNS-SD records) and reasonably intelligent clients (implementing service discovery and priority/weight parsing) should give granular control over the fail-over and load-balancing processes in the communication between multiple entities.

This technology supports system elasticity. Modern cloud service providers have APIs to control DNS zones. In this article, AWS Route53 will be used to demonstrate how elastic service can be introduced through DNS-SD to clients. Backend service scaling logic can modify service locator records to reflect current service state as far as DNS zone modification API is available. This is just part of the control plane for the service …

Bonus point: DNS also gives you simple, replicated key-value store through TXT records!

Implementation of Service Discovery with DNS-SD, AWS Route53, AWS IAM and AWS EC2 UserData

Following is a set of steps and sample code to implement Service Discovery in AWS, using Route53, IAM and EC2.

Manual configuration

1. Create PTR and TXT Records for theService in Route53:
Screen Shot 2014-06-16 at 2.07.22 PM

This is a simple example for one service with one incarnation (v1).

NOTE: There is no SRV since the service is currently not running anywhere! Active service providers will create/update/delete SRV entries.

2. Create IAM role for EC2 instances to be able to modify DNS records in desired Zone:
Screen Shot 2014-06-16 at 2.24.34 PM

Use the following policy:

… where XXXXYYYYZZZZ is your hosted zone ID!

Automated JOIN/LEAVE in service group

Manual settings, outlined in the previous section give the basic framework of the DNS-SD setup. There is no SRV record since there are no active instances providing the service. Ideally, each active service provider will register/de-register with the service when available. This is key here: DNS-SD can be integrated cleanly with the elastic nature of the cloud. Once this integration is at place, all clients will only need to resolve DNS records in order to obtain list of active service providers. For demonstration purposes the following script was created:

Copy of the code can be downloaded from

This code, given DNS zone, service name and service port, will update necessary DNS records to join or leave the service group.

Starting with initial state:
Screen Shot 2014-06-16 at 4.59.40 PM

Executing JOIN:

Screen Shot 2014-06-16 at 4.59.07 PM

Executing LEAVE:

Screen Shot 2014-06-16 at 4.59.40 PM

Domain-specific hostname is created, service location record (SRV) is created with proper port and hostname. When host leaves the service group, domain-specific hostname is removed, so is the entry in the SRV record, or the whole record if this is the last entry.

Fully automated setup

UserData will be used to fully automate the process. There are many options: Puppet, Chef, Salt, Ansible and all of those can be used, but the UserData solution is with reduced complexity, no external dependencies and can be directly utilized by other AWS Services like CloudFormation, AutoScalingGroups, etc.

The full UserData content is as follows:

Copy of the code can be downloaded from

Starting 3 test instances to verify functionality:

Resulting changes to Route53:
Screen Shot 2014-06-16 at 7.21.05 PM

Three new boxes self-registered in the Service group. Stopping manually one leads to de-registration:
Screen Shot 2014-06-16 at 7.22.19 PM

Elastic systems are possible to implement with DNS-SD! Note however, that the DNS records are limited to 65536 bytes, so the amount of entries that can go into SRV record, although big, is limited!

Client code

To demonstrate DNS-SD resolution, the following sample code was created:

Copy of the code can be downloaded from

Why would that be better?! Yes, there is added complexity in the name resolution process. But, more importantly, details needed to find the service are agnostic to its location, or specific to the client. Service-specific infrastructure can change, but the client will not be affected, as long as the discovery process is performed.

Sample run:

VoilĂ ! Reliable Service Discovery in elastic systems!

Additional Notes

Some additional notes and well-knowns:

  • Examples in this article could be extended to support fail-over or more sophisticated forms of load-balancing. Current random.choice() solution should be good enough for the generic case;

  • More complex setup with different priorities and weights can be demonstrated too;

  • Service health-check before DNS-SD registration can be demonstrated too;

  • Non-HTTP service can be demonstrated to use DNS-SD. Technology is application-agnostic.

  • TXT contents are not used throughout this article. Those can be used to carry additional meta-data (NOTE: This is public! Anyone can query your DNS TXT records with this setup!).


Quick implementation of DNS-SD with AWS Route53, IAM and EC2 was presented in this article. It can be used as a bare-bone setup that can be further extended and productized. It solves common problem in elastic systems: Service Discovery! All key components are implemented in either Python or Shell script with minimal dependencies (sudo aptitude install awscli, python-boto, python-requests, python-dnspython), although the implementation is not dependent on a particular programming language.


The “Aggressive DevOps” Concept

2014/06/09 DevOps, theCloud, Xi Group Ltd. , , , ,

Xi Group Ltd. is involved in operational and DevOps activities for several years now and during that period several lessons became clear:

  • Not all architectures are cloud-friendly;
  • You need to know your data flows to benefit from the cloud services;
  • Many companies use theCloud only as easy provisioning technology;
  • Elasticity is hard to implement!

Lets leave the former two for another blog posts and concentrate on the the latter.

Many companies use theCloud only as easy provisioning technology

… and it is natural when you come from a static infrastructure world. However, this is not what theCloud is all about. Yes, you can use it for that, in same way you can use a steam roller to iron your shirts. Yes, there are other arguments why you may want to put something in theCloud, but just copying your infrastructure in theCloud and stating that this makes it resilient of fault-tolerant is plain stupid! AWS had outages, by extension Heroku had outages, Joyent had outages … you need to design for reliability to achieve reliability. “Easy provisioning” is good, but if it’s your main reason, stick with the physical infrastructure. It is probably cheaper in the long run. However, if you want to fully utilize this technology, keep reading!

Elasticity is hard to implement!

Many workloads are elastic in nature, from website traffic to batch processing. The elastic nature of the workload may look like a sine wave or like a spike, the logic is the same. You’d generally want to adjust the amount of resources you allocate so that it’s enough to cover the workload, but also minimize the price you pay. And for any practical problem it is damn hard to do so! Why?! We identified the following reasons:

  • Software environment is chaotic;
  • Software deployment is non-trivial task;
  • Tightly-coupled architectures;
  • Data store functionality misaligned with the nature of the data;
  • Data stores inherently rigid;
  • Lack of operational information;
  • Monitoring wrong resources;
  • Monitoring returns unusable information;
  • There is no ‘control plane’ in the system;
  • ‘Control plane’ depends on infrastructure stability;
  • … and many, many others …

Those helped identify the following “pillars” as the foundation for large-scale elastic deployments:

  1. Build & Deployment Automation
  2. Full-stack Application Monitoring
  3. Data-driven Control Plane

Aggressive DevOps is the process of implementing the pillars in a business environment!

It is our company mission to develop and provide the tooling, the services and the know-how for others to implement large-elastic deployments.

Over a series of blog posts we will go into further details about each of Aggressive DevOps pillars. Stay tuned!