DevOps

Small Tip: AWS announces T2 instance types

2014/07/04 AWS, Development, DevOps, Operations, Small Tip AWS, DevOps, instance types, instances, t2, t2.medium, t2.micro, t2.small

One of the oldest and probably one of the most popular instance types, the t1.micro was recently upgraded by AWS. Three new instance types were introduced to fill the gap between t1.micro and the current-next, m3.medium. The new generation is called T2, uses only HVM based virtualization and comes with EBS only store support. There are three new instance types:

t2.micro
t2.small
t2.medium

Those instance types are all “Burstable Performance Instances” which means they are suitable for unsustained loads. This is also supported by the EBS Only store, which effectively means that high-volume I/O is out of the question. The fact that those instances are all using HVM-based virtualization, however, supports quick SCALE-UP to more potent instance types, if needs arise. One notable remark here is that T2 instances are VPC-only, which is a strong indication of the will to move everything into VPCs nowadays. AWS wants you to start using VPCs from the start!

The instance resource matrix now looks like this:

Instance Type	Virtualization Type	CPU Cores	Memory	Storage
t1.micro	PV	1	0.613 GB	EBS Only
t2.micro	HVM	1	1 GB	EBS Only
m1.small	PV	1	1.7 GB	EBS Only
t2.small	HVM	1	2 GB	EBS Only
m3.medium	HVM	1	3.75 GB	EBS + SSD
t2.medium	HVM	2	4 GB	EBS Only

As stated by AWS, the target uses for the new, T2 instance type family, includes:

Development environments;
Private experimentation;
Educational use;
Build servers / Code repositories;
Low-traffic web applications;
Small databases.

To evaluate the meaning of “Burstable Performance Instances“, here are CPU benchmark results on several instance instance types:

Instance Type	DES crypts/s	MD5 crypts/s	Blowfish crypts/s	Generic crypts/s
t1.micro	~ 2 407 000	~ 6 869	~ 442	~ 187 257
t2.micro	~ 4 757 000	~ 14 164	~ 851	~ 344 928
m1.small	~ 1 218 000	~ 3 480	~ 222	~ 92 870
t2.small	~ 4 993 000	~ 14 245	~ 854	~ 347 961
m3.medium	~ 2 272 000	~ 6 429	~ 386	~ 158 342
t2.medium	~ 5 045 000	~ 14 592	~ 878	~ 356 544

All instances use detault settings for storage, Amazon Linux AMI 2014.03.2, John The Ripper 1.8.0, measuring real crypts with many salts! The test is fairly synthetic, but answers the key question: What difference does it make to have a Burstable instance type? And the answer: If CPU load is not sustainable, it’s more than twice as fast!

Price-wise the new instance types are also better. Cost reduction of On Demand prices of more than 35% allows you to run t2.micro for less than 10 USD/m! Watch out, DigitalOcean! Obviously, Amazon wants change the already established “AWS for business, DigitalOcean for home” mantra into “AWS Everywhere”.

In conclusion, the new, T2 instance type family, closes the gap between unacceptably low performance instance type (t1.micro) and too expensive instances types (m1.small, m3.medium) which creates the sweet-spot for entry users, cloud enthusiast and home users. As someone said: “Now you have an instance type to run WordPress on!”

DevOps Shell Script Template

2014/07/03 Development, DevOps, Operations No comments cloud, DevOps, heterogenous systems, linux, shell script, template

In everyday life of a DevOps engineer you will have to create multiple pieces of code. Some of those will be run once, others … well others will live forever. Although it may be compelling to just put all the commands in a text editor, save the result and execute it, one should always consider the “bigger picture”. What will happen if your script is run on another OS, on another Linux distribution, or even on a different version of the same Linux distribution?! Another point of view is to think what will happen if somehow your neat 10-line-script has to be executed on say 500 servers?! Can you be sure that all the commands will run successfully there? Can you even be sure that all the commands will even be present? Usually … No!

Faced with similar problems on a daily basis we started devising simple solutions and practices to address them. One of those is the process of standardizing the way different utilities behave, the way they take arguments and report errors. Upon further investigation it became clear that a pattern can be extracted and synthesized in a series of template, one can use in daily work to keep common behavior between different utilities and components.

Here is the basic template used in shell scripts:

#!/bin/sh
#
# DESCRIPTION: ... Include functional description ...
#
# Requiresments:
#	awk
#	... 
#	uname
#
# Example usag:
#	$ template.sh -h 
#	$ template.sh -p ARG1 -q ARG2
#

RET_CODE_OK=0
RET_CODE_ERROR=1

# Help / Usage function
print_help() {
	echo "$0: Functional description of the utility"
	echo ""
	echo "$0: Usage"
	echo "    [-h] Print help"
	echo "    [-p] (MANDATORY) First argument"
	echo "    [-q] (OPTIONAL) Second argument"
	exit $RET_CODE_ERROR;
}

# Check for supported operating system
p_uname=`whereis uname | cut -d' ' -f2`
if [ ! -x "$p_uname" ]; then
	echo "$0: No UNAME available in the system"
	exit $RET_CODE_ERROR;
fi
OS=`$p_uname`
if [ "$OS" != "Linux" ]; then
	echo "$0: Unsupported OS!";
	exit $RET_CODE_ERROR;
fi

# Check if awk is available in the system
p_awk=`whereis awk | cut -d' ' -f2`
if [ ! -x "$p_awk" ]; then
	echo "$0: No AWK available in the system!";
	exit $RET_CODE_ERROR;
fi

# Check for other used local utilities
#	bc
#	curl
#	grep 
#	etc ...

# Parse command line arguments
while test -n "$1"; do
	case "$1" in
	--help|-h)
		print_help
		exit 0
		;;
	-p)
		P_ARG=$2
		shift
		;;
	-q)
		Q_ARG=$2
		shift
		;;
	*)
		echo "$0: Unknown Argument: $1"
		print_help
		exit $RET_CODE_ERROR;
		;;
	esac
	
	shift
done

# Check if mandatory argument is present?
if [ -z "$P_ARG" ]; then
	echo "$0: Required parameter not specified!"
	print_help
	exit $RET_CODE_ERROR;
fi

# ... 

# Check if optional argument is present and if not, initialize!
if [ -z "$Q_ARG" ]; then
	Q_ARG="0";
fi

# ... 

# DO THE ACTUAL WORK HERE 

exit $RET_CODE_OK;

#!/bin/sh

# DESCRIPTION: ... Include functional description ...

# Requiresments:

# awk

# ...

# uname

# Example usag:

# $ template.sh -h

# $ template.sh -p ARG1 -q ARG2

RET_CODE_OK=0

RET_CODE_ERROR=1

# Help / Usage function

print_help() {

echo "$0: Functional description of the utility"

echo ""

echo "$0: Usage"

echo " [-h] Print help"

echo " [-p] (MANDATORY) First argument"

echo " [-q] (OPTIONAL) Second argument"

exit $RET_CODE_ERROR;

}

# Check for supported operating system

p_uname=`whereis uname | cut -d' ' -f2`

if [ ! -x "$p_uname" ]; then

echo "$0: No UNAME available in the system"

exit $RET_CODE_ERROR;

OS=`$p_uname`

if [ "$OS" != "Linux" ]; then

echo "$0: Unsupported OS!";

exit $RET_CODE_ERROR;

# Check if awk is available in the system

p_awk=`whereis awk | cut -d' ' -f2`

if [ ! -x "$p_awk" ]; then

echo "$0: No AWK available in the system!";

exit $RET_CODE_ERROR;

# Check for other used local utilities

# bc

# curl

# grep

# etc ...

# Parse command line arguments

while test -n "$1"; do

case "$1" in

--help|-h)

print_help

exit 0

;;

-p)

P_ARG=$2

shift

;;

-q)

Q_ARG=$2

shift

;;

echo "$0: Unknown Argument: $1"

print_help

exit $RET_CODE_ERROR;

;;

esac

shift

done

# Check if mandatory argument is present?

if [ -z "$P_ARG" ]; then

echo "$0: Required parameter not specified!"

print_help

exit $RET_CODE_ERROR;

# ...

# Check if optional argument is present and if not, initialize!

if [ -z "$Q_ARG" ]; then

Q_ARG="0";

# ...

# DO THE ACTUAL WORK HERE

exit $RET_CODE_OK;

Nothing fancy. Basic framework that does the following:

Lines 3 – 13: Make sure basic documentation, dependency list and example usage patterns are provided with the script itself;
Lines 15 – 16: Define meaningful return codes to allow other utils to identify possible execution problems and react accordingly;
Lines 18 – 27: Basic help/usage() function to provide the user with short guidance on how to use the script;
Lines 29 – 52: Dependency checks to make sure all utilities the script needs are available and executable in the system;
Lines 54 – 77: Argument parsing of everything passed on the command line that supports both short and long argument names;
Lines 79 – 91: Validity checks of the argument values that should make sure arguments are passed contextually correct values;
Lines 95 – N: Actual programming logic to be implemented …

This template is successfully used in a various scenarios: command line utilities, Nagios plugins, startup/shutdown scripts, UserData scripts, daemons implemented in shell script with the help of start-stop-daemon, etc. It is also used to allow deployment on multiple operating systems and distribution versions. Resulting utilities and system components are more resilient, include better documentation and dependency sections, provide the user with similar and intuitive way to get help or pass arguments. Error handling is functional enough to go beyond the simple OK / ERROR state. And all of those are important feature when components must be run in highly heterogenous environments such as most cloud deployments!

Small Tip: How to run non-deamon()-ized processes in the background with SupervisorD

2014/06/26 Development, DevOps, Operations, Small Tip background process, daemon, DevOps, linux, long-running, lts, supervisor, supervisord, Ubuntu

The following article will demonstrate how to use Ubuntu 14.04 LTS and SupervisorD to manage the not-so-uncommon case of long running services that expect to be running in active console / terminal. Those are usually quickly / badly written pieces of code that do not use daemon(), or equivalent function, to properly go into background but instead run forever in the foreground. Over the years multiple solutions emerged, including quite the ugly ones (nohup … 2>&1 logfile &). Luckily, there is a better one, and it’s called SupervisorD. With Ubuntu 14.04 LTS it even comes as a package and it should be part of your DevOps arsenal of tools!

In a typical Python / Web-scale environment multiple components will be implemented in a de-coupled, micro-services, REST-based architecture. One of the popular frameworks for REST is Bottle. And there are multiple approaches to build services with Bottle when full-blown HTTP Server is available (Apache, NginX, etc.) or if performance matters. All of those are valid and somewhat documented. But still, there is the case (and it more common than one would think) when developer will create Bottle server to handle simple task and it will propagate into production, using ugly solution like Screen/TMUX or even nohup. Here is a way to put this under proper control.

Test Server code: test-server.py

#!/usr/bin/env python

# Description: Demo Bottle Server to demonstrate use of SupervisorD
#
# How to run:
#       test-server.py -c test-server.conf
#
# Exepects the following configuration file:
#
#       server:
#               bind_ip: 0.0.0.0
#               bind_port: 8080
#
#       configuration_variable: true
#

import argparse
import time
import yaml
import sys

from bottle import route, run, template

# GET: /
@route('/')
def index():
        static_page = """
<html>
<head>
        <title>Test Server</title>
</head>
<body>
        <center><h2>Test Server is working!</h2></center>
</body>
</html>
        """
        return static_page

# Return the server->bind_ip value from the parsed configuration
def get_bind_ip(config):
        if config:
                return config['server']['bind_ip']
        else:
                return None

# Return the server->bind_port value from the parsed configuation
def get_bind_port(config):
        if config:
                return config['server']['bind_port']
        else:
                return None

# Return sample configuration variable
def get_config_data(config):
        if config:
                return config['configuration_variable']
        else:
                return None

# Main entry point for the application
def main():
        """ Main Entry Point for the appliation """

        # Parse command line arguments
        parser = argparse.ArgumentParser(description='Demo Server using Bottle')
        parser.add_argument('-c', '--config', type=str, required=True, dest='config', help='Configuration File Location')

        args = parser.parse_args()
        conf_file = args.config

        # Check config file accessibility
        try:
                conf_fd = open(conf_file, 'r')
        except IOError as e:
                if e.errno == errno.EACCES or e.errno == errno.ENOENT:
                        print("{progname}: Unable to read the configuration file ({config})!".format(progname=sys.argv[0], config=conf_file))
                        sys.exit(1)
        else:
                with conf_fd:
                        config = yaml.load(conf_fd)
                        conf_fd.close()

        # Get configuration data
        bind_ip = get_bind_ip(config)
        bind_port = get_bind_port(config)

        if bind_ip == None or bind_port == None:
                print("{progname}: Required configuration variable is unavailable!".format(progname=sys.argv[0]))
                sys.exit(1)

        config_data = get_config_data(config)

        # Run the web-server
        if config_data == True:
                run(host=bind_ip, port=bind_port)

if __name__ == '__main__':
    main()

#!/usr/bin/env python

# Description: Demo Bottle Server to demonstrate use of SupervisorD

# How to run:

# test-server.py -c test-server.conf

# Exepects the following configuration file:

# server:

# bind_ip: 0.0.0.0

# bind_port: 8080

# configuration_variable: true

import argparse

import time

import yaml

import sys

from bottle import route, run, template

# GET: /

@route('/')

def index():

static_page = """

<html>

<head>

<title>Test Server</title>

</head>

<body>

<center><h2>Test Server is working!</h2></center>

</body>

</html>

"""

return static_page

# Return the server->bind_ip value from the parsed configuration

def get_bind_ip(config):

if config:

return config['server']['bind_ip']

else:

return None

# Return the server->bind_port value from the parsed configuation

def get_bind_port(config):

if config:

return config['server']['bind_port']

else:

return None

# Return sample configuration variable

def get_config_data(config):

if config:

return config['configuration_variable']

else:

return None

# Main entry point for the application

def main():

""" Main Entry Point for the appliation """

# Parse command line arguments

parser = argparse.ArgumentParser(description='Demo Server using Bottle')

parser.add_argument('-c', '--config', type=str, required=True, dest='config', help='Configuration File Location')

args = parser.parse_args()

conf_file = args.config

# Check config file accessibility

try:

conf_fd = open(conf_file, 'r')

except IOError as e:

if e.errno == errno.EACCES or e.errno == errno.ENOENT:

print("{progname}: Unable to read the configuration file ({config})!".format(progname=sys.argv[0], config=conf_file))

sys.exit(1)

else:

with conf_fd:

config = yaml.load(conf_fd)

conf_fd.close()

# Get configuration data

bind_ip = get_bind_ip(config)

bind_port = get_bind_port(config)

if bind_ip == None or bind_port == None:

print("{progname}: Required configuration variable is unavailable!".format(progname=sys.argv[0]))

sys.exit(1)

config_data = get_config_data(config)

# Run the web-server

if config_data == True:

run(host=bind_ip, port=bind_port)

if __name__ == '__main__':

main()

Test server configuration file: test-server.conf

# Sample configuration file in YAML format for test-server.py

server:
    bind_ip: 0.0.0.0
    bind_port: 8080

configuration_variable: true

# Sample configuration file in YAML format for test-server.py

server:

bind_ip: 0.0.0.0

bind_port: 8080

configuration_variable: true

Manual execution of the server code will looks like this:

ubuntu@ip-10-67-161-137:~/test-server$ ./test-server.py -c test-server.conf
Bottle v0.12.0 server starting up (using WSGIRefServer())...
Listening on http://0.0.0.0:8080/
Hit Ctrl-C to quit.

94.155.194.28 - - [23/Jun/2014 12:34:39] "GET / HTTP/1.1" 200 126
^C
ubuntu@ip-10-67-161-137:~/test-server$

ubuntu@ip-10-67-161-137:~/test-server$ ./test-server.py -c test-server.conf

Bottle v0.12.0 server starting up (using WSGIRefServer())...

Listening on http://0.0.0.0:8080/

Hit Ctrl-C to quit.

94.155.194.28 - - [23/Jun/2014 12:34:39] "GET / HTTP/1.1" 200 126

ubuntu@ip-10-67-161-137:~/test-server$

When the controlling terminal is lost the server will be terminated. Obviously, this is neither acceptable, nor desirable behavior.

With SupervisorD (sudo aptitude install supervisor) the service can be properly managed using simple configuration file.

Example SupervisorD configuration file: /etc/supervisor/conf.d/test-server.conf

[program:test-server]
command=/home/ubuntu/test-server/test-server.py -c /home/ubuntu/test-server/test-server.conf
user=ubuntu
redirect_stderr=true

[program:test-server]

command=/home/ubuntu/test-server/test-server.py -c /home/ubuntu/test-server/test-server.conf

user=ubuntu

redirect_stderr=true

To start the service, execute:

ubuntu@ip-10-67-161-137:~$ sudo supervisorctl start test-server
test-server: started
ubuntu@ip-10-67-161-137:~$

ubuntu@ip-10-67-161-137:~$ sudo supervisorctl start test-server

test-server: started

ubuntu@ip-10-67-161-137:~$

To verify successful service start:

ubuntu@ip-10-67-161-137:~$ ps ax
. . . 
 4353 ?        Ss     0:00 /usr/bin/python /usr/bin/supervisord -c /etc/supervisor/supervisord.conf
 4355 ?        S      0:00 python /home/ubuntu/test-server/test-server.py -c /home/ubuntu/test-server/test-server.conf
. . .
ubuntu@ip-10-67-161-137:~$

ubuntu@ip-10-67-161-137:~$ ps ax

. . .

4353 ? Ss 0:00 /usr/bin/python /usr/bin/supervisord -c /etc/supervisor/supervisord.conf

4355 ? S 0:00 python /home/ubuntu/test-server/test-server.py -c /home/ubuntu/test-server/test-server.conf

. . .

ubuntu@ip-10-67-161-137:~$

SupervisorD will redirect stdout and stderr to properly named log files:

ubuntu@ip-10-67-161-137:~$ sudo cat /var/log/supervisor/test-server-stdout---supervisor-ssaGXP.log
Bottle v0.12.0 server starting up (using WSGIRefServer())...
Listening on http://0.0.0.0:8080/
Hit Ctrl-C to quit.

94.155.194.28 - - [23/Jun/2014 13:31:19] "GET / HTTP/1.1" 200 126
ubuntu@ip-10-67-161-137:~$

ubuntu@ip-10-67-161-137:~$ sudo cat /var/log/supervisor/test-server-stdout---supervisor-ssaGXP.log

Bottle v0.12.0 server starting up (using WSGIRefServer())...

Listening on http://0.0.0.0:8080/

Hit Ctrl-C to quit.

94.155.194.28 - - [23/Jun/2014 13:31:19] "GET / HTTP/1.1" 200 126

ubuntu@ip-10-67-161-137:~$

Those log files can be integrated with a centralized logging architecture or processed for error / anomaly detection separately.

SupervisorD also comes with handy, command-line control utility, supervisorctl:

ubuntu@ip-10-67-161-137:~$ sudo supervisorctl status test-server
test-server                      RUNNING    pid 4355, uptime 0:11:40
ubuntu@ip-10-67-161-137:~$

ubuntu@ip-10-67-161-137:~$ sudo supervisorctl status test-server

test-server RUNNING pid 4355, uptime 0:11:40

ubuntu@ip-10-67-161-137:~$

With some additional effort SupervisorD can react to various types of events (http://supervisord.org/events.html) which bring it one step closer to full process monitoring & notification solution!

References

SupervisorD Homepage: http://supervisord.org
Bottle Web Framework: http://bottlepy.org/docs/dev/index.html

Small Tip: EBS volume allocation time is linear to the size and unrelated to the instance type

2014/06/23 AWS, DevOps, Operations, Small Tip allocation time, AWS, AWS CLI, DevOps, EBS, volume

Due to fluctuations in startup times for instances in AWS, it was speculated that allocation of EBS volumes may be the reason for the nondeterministic behavior. This led to an interesting discussion and finally to a small test to determine how volume size of an EBS volume allocated with an instance affect its startup time.

To gather some results the following script was created: https://s3-us-west-2.amazonaws.com/blog.xi-group.com/aws-ebs-allocation-times/aws-single.sh. It will create one instance of the specified type with N GB of Root EBS volume, wait for the instance to properly start and then terminate it. The time for the whole process is measured (e.g. full ‘time-to-service’).

The script was run multiple times for each instance type and EBS volume size. Results are presented in the following table:

	t1.micro	c1.xlarge	m3.xlarge	m3.2xlarge	m2.4xlarge
20 GB	~ 1m 50s	~ 1m 45s	~ 1m 50s	~ 2m 15s	~ 3m 20s
50 GB	~ 2m 45s	~ 2m 40s	~ 2m 50s	~ 2m 40s	~ 3m 10s
100 GB	~ 3m 45s	~ 3m 30s	~ 3m 30s	~ 4m 20s	~ 5m 00s
200 GB	~ 6m 00s	~ 6m 10s	~ 9m 00s	~ 5m 45s	~ 7m 30s

Graphical representation:

As shown, instance start time grows linearly with the size of the EBS Root volume. Moral of the story:

The more EBS storage you allocate at boot, the slower the instance will start!

NOTE: The whole procedure is reasonably time consuming if you gather multiple data points (in this case, for each instance type / volume size the script was run 3 times and the average value is shown). It will cost money, since all EC2 allocations will be charged for at least an hour. The script, provided here is ‘AS IS’ and can be used as reference. Be sure to understand it and properly modify it before running it!

How to implement Service Discovery in the Cloud

2014/06/17 AWS, Development, DevOps, theCloud No comments AWS, AWS CLI, cloud, DevOps, distributed systems, dns, dns-sd, elastic computing, service discovery

Introduction

Service Discovery is not new technology. Unfortunately, it is barely understood and rarely implemented. It is a problem that many system architects face and it is key to multiple desirable qualities of a modern, cloud enabled, elastic distributed system such as reliability, availability, maintainability. There are multiple ways to approach service discovery:

Hardcode service locations;
Develop proprietary solution;
Use existing technology.

Hardcoding is still the common case. How often do you encounter hardcoded URLs in configuration files?! Developing proprietary solution becomes popular too. Multiple companies decided to address Service Discovery by implementing some sort of distributed key-value store. Amongst the popular ones: Etsy’s etcd, Heroku’s Doozer, Apache ZooKeeper, Google’s Chubby. Even Redis can used for such purposes. But for many cases additional software layers and programming complexity is not needed. There is already existing solution based on DNS. It is called DNS-SD and is defined in RFC6763.

DNS-SD utilizes PTR, SRV and TXT DNS records to provide flexible service discovery. All major DNS implementations support it. All major cloud providers support it. DNS is well established technology, well understood by both Operations and Development personnel with strong support in programming languages and libraries. It is highly-available by replication.

How does DNS-SD work?

DNS-SD uses three DNS records types: PTR, SRV, TXT:

PTR record is defined in RFC1035 as “domain name pointer”. Unlike CNAME records no processing of the contents is performed, data is returned directly.
SRV record is defined in RFC2782 as “service locator”. It should provide protocol agnostic way to locate services, in contrast to the MX records. It contains four components: priority, weight, port and target.
TXT record is defined in RFC1035 as “text string”.

There are multiple specifics around protocol and service naming conventions that are beyond the scope of this post. For more information please refer to RFC6763. For the purposes of this article, it is assumed that a proprietary TCP-based service, called theService that has different reincarnations runs on TCP port 4218 on multiple hosts. The basic idea is:

Create a pointer record for _theSerivce that contains all available incarnations of the service;
For each incarnation create SRV record (where the service is located) and TXT record (any additional information for the client) that specify the service details.

This is what sample configuration looks like in AWS Route53 for the unilans.net. domain:

Using nslookup results can be verified:

:~> nslookup -q=PTR _theService._tcp.unilans.net.
Server:         8.8.8.8
Address:        8.8.8.8#53

Non-authoritative answer:
_theService._tcp.unilans.net    name = _incarnation1._theService._tcp.unilans.net.
_theService._tcp.unilans.net    name = _incarnation2._theService._tcp.unilans.net.

Authoritative answers can be found from:

:~> nslookup -q=any _incarnation1._theService._tcp.unilans.net.
Server:         8.8.8.8
Address:        8.8.8.8#53

Non-authoritative answer:
_incarnation1._theService._tcp.unilans.net      text = "txtvers=1\; data=sampledata\;"
_incarnation1._theService._tcp.unilans.net      service = 0 0 4218 host1.unilans.net.

Authoritative answers can be found from:

:~>

:~> nslookup -q=PTR _theService._tcp.unilans.net.

Server: 8.8.8.8

Address: 8.8.8.8#53

Non-authoritative answer:

_theService._tcp.unilans.net name = _incarnation1._theService._tcp.unilans.net.

_theService._tcp.unilans.net name = _incarnation2._theService._tcp.unilans.net.

Authoritative answers can be found from:

:~> nslookup -q=any _incarnation1._theService._tcp.unilans.net.

Server: 8.8.8.8

Address: 8.8.8.8#53

Non-authoritative answer:

_incarnation1._theService._tcp.unilans.net text = "txtvers=1\; data=sampledata\;"

_incarnation1._theService._tcp.unilans.net service = 0 0 4218 host1.unilans.net.

Authoritative answers can be found from:

:~>

Now a client that wants to use incarnation1 of theService has means to access it (Host: host1.unilans.net, Port: 4218).

Load-balaing can be implementing by adding another entry in the service locator record with the same priority and weight:

Resulting DNS lookup:

:~> nslookup -q=any _incarnation1._theService._tcp.unilans.net.
Server:         8.8.8.8
Address:        8.8.8.8#53

Non-authoritative answer:
_incarnation1._theService._tcp.unilans.net      text = "txtvers=1\; data=sampledata\;"
_incarnation1._theService._tcp.unilans.net      service = 0 0 4218 host1.unilans.net.
_incarnation1._theService._tcp.unilans.net      service = 0 0 4218 host100.unilans.net.

Authoritative answers can be found from:

:~>

:~> nslookup -q=any _incarnation1._theService._tcp.unilans.net.

Server: 8.8.8.8

Address: 8.8.8.8#53

Non-authoritative answer:

_incarnation1._theService._tcp.unilans.net text = "txtvers=1\; data=sampledata\;"

_incarnation1._theService._tcp.unilans.net service = 0 0 4218 host1.unilans.net.

_incarnation1._theService._tcp.unilans.net service = 0 0 4218 host100.unilans.net.

Authoritative answers can be found from:

:~>

In a similar way, fail-over can be implemented by using different priority (or load distribution using different weights):

Resulting DNS lookup:

:~> nslookup -q=any _incarnation1._theService._tcp.unilans.net.
Server:         8.8.8.8
Address:        8.8.8.8#53

Non-authoritative answer:
_incarnation1._theService._tcp.unilans.net      text = "txtvers=1\; data=sampledata\;"
_incarnation1._theService._tcp.unilans.net      service = 0 0 4218 host1.unilans.net.
_incarnation1._theService._tcp.unilans.net      service = 1 0 4218 host100.unilans.net.

Authoritative answers can be found from:

:~>

:~> nslookup -q=any _incarnation1._theService._tcp.unilans.net.

Server: 8.8.8.8

Address: 8.8.8.8#53

Non-authoritative answer:

_incarnation1._theService._tcp.unilans.net text = "txtvers=1\; data=sampledata\;"

_incarnation1._theService._tcp.unilans.net service = 0 0 4218 host1.unilans.net.

_incarnation1._theService._tcp.unilans.net service = 1 0 4218 host100.unilans.net.

Authoritative answers can be found from:

:~>

NOTE: With DNS the client is the one to implement the load-balacing or the fail-over (although there are exceptions to this rule)!

Benefits of using DNS-SD for Service Discovery

This technology can be used to support multiple version of a service. Using the built-in support for different reincarnations of the same service, versioning can be implemented in clean granular way. Common problem in REST system, usually solved by nasty URL schemes or rewriting URLs. With DNS-SD required metadata can be passed through the TXT records and multiple versions of the communication protocol can be supported, each in contained environment … No name space pollution, no clumsy URL schemes, no URL rewriting …

This technology can be utilized to reduce complexity while building distributed systems. The clients will most certainly go through the process of name resolution anyway, so why not incorporate service discover in it?! Instead of dealing with external system (installation, operation, maintenance) and all the possible issues (hard to configure, hard to maintain, immature, fault-intollerant, requires additional libraries in the codebase, etc), incorporate this with the name resolution. DNS is well supported on virtually all operating systems and with all programming languages that provide network programming abilities. System architecture complexity is reduced because subsystem that already exists is providing additional services, instead of introducing new systems.

This technology can be utilized to increase reliability / fault-tolerance. Reliability / fault-tolerance can be easily increased by serving multiple entries with the service locator records. Priority can be used by the client to go through the list of entries in controlled manner and weight to balance the load between the service providers on each priority level. The combination of backend support (control plane updating DNS-SD records) and reasonably intelligent clients (implementing service discovery and priority/weight parsing) should give granular control over the fail-over and load-balancing processes in the communication between multiple entities.

This technology supports system elasticity. Modern cloud service providers have APIs to control DNS zones. In this article, AWS Route53 will be used to demonstrate how elastic service can be introduced through DNS-SD to clients. Backend service scaling logic can modify service locator records to reflect current service state as far as DNS zone modification API is available. This is just part of the control plane for the service …

Bonus point: DNS also gives you simple, replicated key-value store through TXT records!

Implementation of Service Discovery with DNS-SD, AWS Route53, AWS IAM and AWS EC2 UserData

Following is a set of steps and sample code to implement Service Discovery in AWS, using Route53, IAM and EC2.

Manual configuration

1. Create PTR and TXT Records for theService in Route53:

This is a simple example for one service with one incarnation (v1).

NOTE: There is no SRV since the service is currently not running anywhere! Active service providers will create/update/delete SRV entries.

2. Create IAM role for EC2 instances to be able to modify DNS records in desired Zone:

Use the following policy:

{
   "Version": "2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "route53:ListHostedZones"
         ],
         "Resource":"*"
      },
      {
         "Effect":"Allow",
         "Action":[
            "route53:GetHostedZone", 
            "route53:ListResourceRecordSets",
            "route53:ChangeResourceRecordSets"
         ],
         "Resource":"arn:aws:route53:::hostedzone/XXXXYYYYZZZZ"
      },
      {
         "Effect":"Allow",
         "Action":[
            "route53:GetChange"
         ],
         "Resource":"arn:aws:route53:::change/*"
      }
   ]
}

{

"Version": "2012-10-17",

"Statement":[

{

"Effect":"Allow",

"Action":[

"route53:ListHostedZones"

"Resource":"*"

{

"Effect":"Allow",

"Action":[

"route53:GetHostedZone",

"route53:ListResourceRecordSets",

"route53:ChangeResourceRecordSets"

"Resource":"arn:aws:route53:::hostedzone/XXXXYYYYZZZZ"

{

"Effect":"Allow",

"Action":[

"route53:GetChange"

"Resource":"arn:aws:route53:::change/*"

}

]

}

… where XXXXYYYYZZZZ is your hosted zone ID!

Automated JOIN/LEAVE in service group

Manual settings, outlined in the previous section give the basic framework of the DNS-SD setup. There is no SRV record since there are no active instances providing the service. Ideally, each active service provider will register/de-register with the service when available. This is key here: DNS-SD can be integrated cleanly with the elastic nature of the cloud. Once this integration is at place, all clients will only need to resolve DNS records in order to obtain list of active service providers. For demonstration purposes the following script was created:

#!/usr/bin/env python

# The following code modifies AWS Route53 entries to demonstrate usage of DNS-SD in cloud environments
#
# To JOIN Service group:
# 	dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 join
#
# To LEAVE Service group:
#	dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 leave
#
# NOTE: THIS IS FOR DEMONSTRATION PURPOSES ONLY! ERROR HANDLING IS ABSOLUTE MINIMAL! THIS IS *NOT* PRODUCTION CODE!

import sys
import copy
import argparse

import requests
import boto.route53

def main():
	"""
	Main entry point
	"""

	# Parse command line arguments
	parser = argparse.ArgumentParser(description='Example code to update service records in Route53 hosted DNS zones')
	parser.add_argument('-z', '--zone', type=str, required=True, dest='zone', help='Zone Name')
	parser.add_argument('-s', '--service', type=str, required=True, dest='service', help='Service Name')
	parser.add_argument('-p', '--port', type=int, required=True, dest='port', help='Service Port')
	parser.add_argument('operation', metavar='OPERATION', type=str, help='Operation [join|leave]', choices=['join', 'leave'])

	args = parser.parse_args()
	operation = args.operation
	zone = args.zone
	service = args.service
	port = args.port

	# Establish connection to Route53 API
	conn = boto.route53.connection.Route53Connection()

	# Get zone handler
	z = conn.get_zone(zone)
	if not z:
		print "{progname}: Wrong or inaccessible zone!".format(progname=sys.argv[0])
		sys.exit(-1)

	# Get EC2 Public IP Address
	response = requests.get('http://169.254.169.254/latest/meta-data/public-ipv4')
	if response.status_code == 200:
		public_ipv4 = response.text
	else:
		print "{progname}: Unable to obtain public IP address from AWS!".format(progname=sys.argv[0])
		sys.exit(-1)

	# Generate domain-specific hostname
	fqdn_hostname = '{hostname}.{zone}'.format(hostname=public_ipv4.replace(".", "-"), zone=zone)

	# Act, based on operation request (join | leave)
	if operation.upper() == 'join'.upper():
		# Create A record
		z.add_a(fqdn_hostname, public_ipv4, ttl=60)

		# Obtain service locator records
		r = z.find_records(service, 'SRV')
		if not r:
			# Create SRV record
			srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)
			z.add_record('SRV', service, srv_value, ttl=60)
		else:
			# Add to SRV record
			srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)
			tmp_r = copy.deepcopy(r)
			tmp_r.resource_records.append(srv_value)
			z.update_record(r, tmp_r.resource_records)

	elif operation.upper() == 'leave'.upper():
		# Remove entry from the SRV record
		r = z.find_records(service, 'SRV')
		if r:
			tmp_r = copy.deepcopy(r)
			for record in tmp_r.resource_records:
				if fqdn_hostname in record:
					tmp_r.resource_records.remove(record)

			if len(tmp_r.resource_records) == 0:
				# Remove the SRV entry itself
				z.delete_record(r)
			else:
				# Update the SRV record
				z.update_record(r, tmp_r.resource_records)

		# Remove A record
		r = z.find_records(fqdn_hostname, 'A')
		if r:
			z.delete_record(r)

	else:
		print "{progname}: Wrong operation!".format(progname=sys.argv[0])
		sys.exit(-1)

if __name__ == '__main__':
	main()

100

101

102

103

#!/usr/bin/env python

# The following code modifies AWS Route53 entries to demonstrate usage of DNS-SD in cloud environments

# To JOIN Service group:

# dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 join

# To LEAVE Service group:

# dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 leave

# NOTE: THIS IS FOR DEMONSTRATION PURPOSES ONLY! ERROR HANDLING IS ABSOLUTE MINIMAL! THIS IS *NOT* PRODUCTION CODE!

import sys

import copy

import argparse

import requests

import boto.route53

def main():

"""

Main entry point

"""

# Parse command line arguments

parser = argparse.ArgumentParser(description='Example code to update service records in Route53 hosted DNS zones')

parser.add_argument('-z', '--zone', type=str, required=True, dest='zone', help='Zone Name')

parser.add_argument('-s', '--service', type=str, required=True, dest='service', help='Service Name')

parser.add_argument('-p', '--port', type=int, required=True, dest='port', help='Service Port')

parser.add_argument('operation', metavar='OPERATION', type=str, help='Operation [join|leave]', choices=['join', 'leave'])

args = parser.parse_args()

operation = args.operation

zone = args.zone

service = args.service

port = args.port

# Establish connection to Route53 API

conn = boto.route53.connection.Route53Connection()

# Get zone handler

z = conn.get_zone(zone)

if not z:

print "{progname}: Wrong or inaccessible zone!".format(progname=sys.argv[0])

sys.exit(-1)

# Get EC2 Public IP Address

response = requests.get('http://169.254.169.254/latest/meta-data/public-ipv4')

if response.status_code == 200:

public_ipv4 = response.text

else:

print "{progname}: Unable to obtain public IP address from AWS!".format(progname=sys.argv[0])

sys.exit(-1)

# Generate domain-specific hostname

fqdn_hostname = '{hostname}.{zone}'.format(hostname=public_ipv4.replace(".", "-"), zone=zone)

# Act, based on operation request (join | leave)

if operation.upper() == 'join'.upper():

# Create A record

z.add_a(fqdn_hostname, public_ipv4, ttl=60)

# Obtain service locator records

r = z.find_records(service, 'SRV')

if not r:

# Create SRV record

srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)

z.add_record('SRV', service, srv_value, ttl=60)

else:

# Add to SRV record

srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)

tmp_r = copy.deepcopy(r)

tmp_r.resource_records.append(srv_value)

z.update_record(r, tmp_r.resource_records)

elif operation.upper() == 'leave'.upper():

# Remove entry from the SRV record

r = z.find_records(service, 'SRV')

if r:

tmp_r = copy.deepcopy(r)

for record in tmp_r.resource_records:

if fqdn_hostname in record:

tmp_r.resource_records.remove(record)

if len(tmp_r.resource_records) == 0:

# Remove the SRV entry itself

z.delete_record(r)

else:

# Update the SRV record

z.update_record(r, tmp_r.resource_records)

# Remove A record

r = z.find_records(fqdn_hostname, 'A')

if r:

z.delete_record(r)

else:

print "{progname}: Wrong operation!".format(progname=sys.argv[0])

sys.exit(-1)

if __name__ == '__main__':

main()

Copy of the code can be downloaded from https://s3-us-west-2.amazonaws.com/blog.xi-group.com/aws-route53-iam-ec2-dns-sd/dns-sd.py

This code, given DNS zone, service name and service port, will update necessary DNS records to join or leave the service group.

Starting with initial state:

Executing JOIN:

dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 join

1	dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 join

Result:

Executing LEAVE:

dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 leave

1	dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 leave

Result:

Domain-specific hostname is created, service location record (SRV) is created with proper port and hostname. When host leaves the service group, domain-specific hostname is removed, so is the entry in the SRV record, or the whole record if this is the last entry.

Fully automated setup

UserData will be used to fully automate the process. There are many options: Puppet, Chef, Salt, Ansible and all of those can be used, but the UserData solution is with reduced complexity, no external dependencies and can be directly utilized by other AWS Services like CloudFormation, AutoScalingGroups, etc.

The full UserData content is as follows:

#!/bin/bash -ex

# Debian apt-get install function
apt_get_install()
{
	DEBIAN_FRONTEND=noninteractive apt-get -y \
	-o DPkg::Options::=--force-confdef \
	-o DPkg::Options::=--force-confold \
	install $@
}

# Mark execution start
echo "STARTING" > /root/user_data_run

# Some initial setup
set -e -x
export DEBIAN_FRONTEND=noninteractive
apt-get update && apt-get upgrade -y

# Install required packages
apt_get_install python-boto python-requests
apt_get_install nginx

# Create test html page
mkdir /var/www
cat > /var/www/index.html << "EOF"
<html>
	<head>
		<title>Demo Page</title>
	</head>

	<body>
		<center><h2>Demo Page</h2></center><br>
		<center>Status: running</center>
	</body>
</html>
EOF

# Configure NginX
cat > /etc/nginx/conf.d/demo.conf << "EOF"
# Minimal NginX VirtualHost setup
server {
	listen 8080;

	root /var/www;
	index index.html index.htm;

	location / {
		try_files $uri $uri/ =404;
	}
}
EOF

# Restart NginX with the new settings
/etc/init.d/nginx restart

# Create dns-sd.py
cat > /usr/local/sbin/dns-sd.py << "EOF"
#!/usr/bin/env python

# The following code modifies AWS Route53 entries to demonstrate usage of DNS-SD in cloud environments
#
# To JOIN Service group:
# 	dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 join
#
# To LEAVE Service group:
#	dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 leave
#
# NOTE: THIS IS FOR DEMONSTRATION PURPOSES ONLY! ERROR HANDLING IS ABSOLUTE MINIMAL! THIS IS *NOT* PRODUCTION CODE!

import sys
import copy
import argparse

import requests
import boto.route53

def main():
	"""
	Main entry point
	"""

	# Parse command line arguments
	parser = argparse.ArgumentParser(description='Example code to update service records in Route53 hosted DNS zones')
	parser.add_argument('-z', '--zone', type=str, required=True, dest='zone', help='Zone Name')
	parser.add_argument('-s', '--service', type=str, required=True, dest='service', help='Service Name')
	parser.add_argument('-p', '--port', type=int, required=True, dest='port', help='Service Port')
	parser.add_argument('operation', metavar='OPERATION', type=str, help='Operation [join|leave]', choices=['join', 'leave'])

	args = parser.parse_args()
	operation = args.operation
	zone = args.zone
	service = args.service
	port = args.port

	# Establish connection to Route53 API
	conn = boto.route53.connection.Route53Connection()

	# Get zone handler
	z = conn.get_zone(zone)
	if not z:
		print "{progname}: Wrong or inaccessible zone!".format(progname=sys.argv[0])
		sys.exit(-1)

	# Get EC2 Public IP Address
	response = requests.get('http://169.254.169.254/latest/meta-data/public-ipv4')
	if response.status_code == 200:
		public_ipv4 = response.text
	else:
		print "{progname}: Unable to obtain public IP address from AWS!".format(progname=sys.argv[0])
		sys.exit(-1)

	# Generate domain-specific hostname
	fqdn_hostname = '{hostname}.{zone}'.format(hostname=public_ipv4.replace(".", "-"), zone=zone)

	# Act, based on operation request (join | leave)
	if operation.upper() == 'join'.upper():
		# Create A record
		z.add_a(fqdn_hostname, public_ipv4, ttl=60)

		# Obtain service locator records
		r = z.find_records(service, 'SRV')
		if not r:
			# Create SRV record
			srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)
			z.add_record('SRV', service, srv_value, ttl=60)
		else:
			# Add to SRV record
			srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)
			tmp_r = copy.deepcopy(r)
			tmp_r.resource_records.append(srv_value)
			z.update_record(r, tmp_r.resource_records)

	elif operation.upper() == 'leave'.upper():
		# Remove entry from the SRV record
		r = z.find_records(service, 'SRV')
		if r:
			tmp_r = copy.deepcopy(r)
			for record in tmp_r.resource_records:
				if fqdn_hostname in record:
					tmp_r.resource_records.remove(record)

			if len(tmp_r.resource_records) == 0:
				# Remove the SRV entry itself
				z.delete_record(r)
			else:
				# Update the SRV record
				z.update_record(r, tmp_r.resource_records)

		# Remove A record
		r = z.find_records(fqdn_hostname, 'A')
		if r:
			z.delete_record(r)

	else:
		print "{progname}: Wrong operation!".format(progname=sys.argv[0])
		sys.exit(-1)

if __name__ == '__main__':
	main()

EOF

# Make dns-sd.py executable
chmod +x /usr/local/sbin/dns-sd.py

# Create startup job
cat > /etc/init.d/dns-sd << "EOF"
#! /bin/bash
#
# Author: Ivo Vachkov (ivachkov@xi-group.com)
#
### BEGIN INIT INFO
# Provides: DNS-SD Service Group Registration / De-Registration
# Required-Start:
# Should-Start:
# Required-Stop:
# Should-Stop:
# Default-Start:  2 3 4 5
# Default-Stop:   0 1 6
# Short-Description:    Start / Stop script for DNS-SD
# Description:          Use to JOIN/LEAVE DNS-SD Service Group
### END INIT INFO

set -e
umask 022

# Configuration details
DNS_SD="/usr/local/sbin/dns-sd.py"
DNS_ZONE="unilans.net"
SERVICE_NAME="_v1._theservice._tcp.unilans.net."
SERVICE_PORT="8080"

. /lib/lsb/init-functions

export PATH="${PATH:+$PATH:}/usr/sbin:/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin"

# Default Start function
dns_sd_join () {
	$DNS_SD -z $DNS_ZONE -s $SERVICE_NAME -p $SERVICE_PORT join
}

# Default Stop function
dns_sd_leave () {
	$DNS_SD -z $DNS_ZONE -s $SERVICE_NAME -p $SERVICE_PORT leave
}

case "$1" in
start)
	log_daemon_msg "Joining $DNS_ZONE|$SERVICE_NAME:$SERVICE_PORT ... " || true
	dns_sd_join
	;;
stop)
	log_daemon_msg "Leaving $DNS_ZONE|$SERVICE_NAME:$SERVICE_PORT ... " || true
	dns_sd_leave
	;;
restart)
	log_daemon_msg "Restarting ... " || true
	dns_sd_leave
	dns_sd_join
	;;
*)
	log_action_msg "Usage: $0 {start|stop|restart}" || true
	exit 1
esac

exit 0
EOF

# Make /etc/init.d/dns-sd executable
chmod +x /etc/init.d/dns-sd

# Set automatic execution on start/shutdown
update-rc.d dns-sd defaults 99

# Execute initial service group JOIN
/etc/init.d/dns-sd start

# Mark execution end
echo "DONE" > /root/user_data_run

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

#!/bin/bash -ex

# Debian apt-get install function

apt_get_install()

{

DEBIAN_FRONTEND=noninteractive apt-get -y \

-o DPkg::Options::=--force-confdef \

-o DPkg::Options::=--force-confold \

install $@

}

# Mark execution start

echo "STARTING" > /root/user_data_run

# Some initial setup

set -e -x

export DEBIAN_FRONTEND=noninteractive

apt-get update && apt-get upgrade -y

# Install required packages

apt_get_install python-boto python-requests

apt_get_install nginx

# Create test html page

mkdir /var/www

cat > /var/www/index.html << "EOF"

<html>

<head>

</head>

<body>

<center>Status: running</center>

</body>

</html>

EOF

# Configure NginX

cat > /etc/nginx/conf.d/demo.conf << "EOF"

# Minimal NginX VirtualHost setup

server {

listen 8080;

root /var/www;

index index.html index.htm;

location / {

try_files $uri $uri/ =404;

}

EOF

# Restart NginX with the new settings

/etc/init.d/nginx restart

# Create dns-sd.py

cat > /usr/local/sbin/dns-sd.py << "EOF"

#!/usr/bin/env python

# The following code modifies AWS Route53 entries to demonstrate usage of DNS-SD in cloud environments

# To JOIN Service group:

# dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 join

# To LEAVE Service group:

# dns-sd.py -z unilans.net -s _v1._theservice._tcp.unilans.net. -p 8080 leave

# NOTE: THIS IS FOR DEMONSTRATION PURPOSES ONLY! ERROR HANDLING IS ABSOLUTE MINIMAL! THIS IS *NOT* PRODUCTION CODE!

import sys

import copy

import argparse

import requests

import boto.route53

def main():

"""

Main entry point

"""

# Parse command line arguments

parser = argparse.ArgumentParser(description='Example code to update service records in Route53 hosted DNS zones')

parser.add_argument('-z', '--zone', type=str, required=True, dest='zone', help='Zone Name')

parser.add_argument('-s', '--service', type=str, required=True, dest='service', help='Service Name')

parser.add_argument('-p', '--port', type=int, required=True, dest='port', help='Service Port')

parser.add_argument('operation', metavar='OPERATION', type=str, help='Operation [join|leave]', choices=['join', 'leave'])

args = parser.parse_args()

operation = args.operation

zone = args.zone

service = args.service

port = args.port

# Establish connection to Route53 API

conn = boto.route53.connection.Route53Connection()

# Get zone handler

z = conn.get_zone(zone)

if not z:

print "{progname}: Wrong or inaccessible zone!".format(progname=sys.argv[0])

sys.exit(-1)

# Get EC2 Public IP Address

response = requests.get('http://169.254.169.254/latest/meta-data/public-ipv4')

if response.status_code == 200:

public_ipv4 = response.text

else:

print "{progname}: Unable to obtain public IP address from AWS!".format(progname=sys.argv[0])

sys.exit(-1)

# Generate domain-specific hostname

fqdn_hostname = '{hostname}.{zone}'.format(hostname=public_ipv4.replace(".", "-"), zone=zone)

# Act, based on operation request (join | leave)

if operation.upper() == 'join'.upper():

# Create A record

z.add_a(fqdn_hostname, public_ipv4, ttl=60)

# Obtain service locator records

r = z.find_records(service, 'SRV')

if not r:

# Create SRV record

srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)

z.add_record('SRV', service, srv_value, ttl=60)

else:

# Add to SRV record

srv_value = u'0 0 {port} {fqdn}'.format(port=port, fqdn=fqdn_hostname)

tmp_r = copy.deepcopy(r)

tmp_r.resource_records.append(srv_value)

z.update_record(r, tmp_r.resource_records)

elif operation.upper() == 'leave'.upper():

# Remove entry from the SRV record

r = z.find_records(service, 'SRV')

if r:

tmp_r = copy.deepcopy(r)

for record in tmp_r.resource_records:

if fqdn_hostname in record:

tmp_r.resource_records.remove(record)

if len(tmp_r.resource_records) == 0:

# Remove the SRV entry itself

z.delete_record(r)

else:

# Update the SRV record

z.update_record(r, tmp_r.resource_records)

# Remove A record

r = z.find_records(fqdn_hostname, 'A')

if r:

z.delete_record(r)

else:

print "{progname}: Wrong operation!".format(progname=sys.argv[0])

sys.exit(-1)

if __name__ == '__main__':

main()

EOF

# Make dns-sd.py executable

chmod +x /usr/local/sbin/dns-sd.py

# Create startup job

cat > /etc/init.d/dns-sd << "EOF"

#! /bin/bash

# Author: Ivo Vachkov (ivachkov@xi-group.com)

### BEGIN INIT INFO

# Provides: DNS-SD Service Group Registration / De-Registration

# Required-Start:

# Should-Start:

# Required-Stop:

# Should-Stop:

# Default-Start: 2 3 4 5

# Default-Stop: 0 1 6

# Short-Description: Start / Stop script for DNS-SD

# Description: Use to JOIN/LEAVE DNS-SD Service Group

### END INIT INFO

set -e

umask 022

# Configuration details

DNS_SD="/usr/local/sbin/dns-sd.py"

DNS_ZONE="unilans.net"

SERVICE_NAME="_v1._theservice._tcp.unilans.net."

SERVICE_PORT="8080"

. /lib/lsb/init-functions

export PATH="${PATH:+$PATH:}/usr/sbin:/sbin:/usr/bin:/usr/local/bin:/usr/local/sbin"

# Default Start function

dns_sd_join () {

$DNS_SD -z $DNS_ZONE -s $SERVICE_NAME -p $SERVICE_PORT join

}

# Default Stop function

dns_sd_leave () {

$DNS_SD -z $DNS_ZONE -s $SERVICE_NAME -p $SERVICE_PORT leave

}

case "$1" in

start)

log_daemon_msg "Joining $DNS_ZONE|$SERVICE_NAME:$SERVICE_PORT ... " || true

dns_sd_join

;;

stop)

log_daemon_msg "Leaving $DNS_ZONE|$SERVICE_NAME:$SERVICE_PORT ... " || true

dns_sd_leave

;;

restart)

log_daemon_msg "Restarting ... " || true

dns_sd_leave

dns_sd_join

;;

log_action_msg "Usage: $0 {start|stop|restart}" || true

exit 1

esac

exit 0

EOF

# Make /etc/init.d/dns-sd executable

chmod +x /etc/init.d/dns-sd

# Set automatic execution on start/shutdown

update-rc.d dns-sd defaults 99

# Execute initial service group JOIN

/etc/init.d/dns-sd start

# Mark execution end

echo "DONE" > /root/user_data_run

Copy of the code can be downloaded from https://s3-us-west-2.amazonaws.com/blog.xi-group.com/aws-route53-iam-ec2-dns-sd/userdata.sh

Starting 3 test instances to verify functionality:

aws ec2 run-instances --image-id ami-018c9568 --count 3 --instance-type t1.micro --key-name test-key --security-groups test-sg --iam-instance-profile Name=DNS-SD-Route53-EC2-Role --user-data file://userdata.sh

1	aws ec2 run-instances --image-id ami-018c9568 --count 3 --instance-type t1.micro --key-name test-key --security-groups test-sg --iam-instance-profile Name=DNS-SD-Route53-EC2-Role --user-data file://userdata.sh

Resulting changes to Route53:

Three new boxes self-registered in the Service group. Stopping manually one leads to de-registration:

Elastic systems are possible to implement with DNS-SD! Note however, that the DNS records are limited to 65536 bytes, so the amount of entries that can go into SRV record, although big, is limited!

Client code

To demonstrate DNS-SD resolution, the following sample code was created:

#!/usr/bin/env python

# The following code demonstrates how to resolve DNS-SD Service Descriptions
#
# Example execution:
#	client.py -z unilans.net. -s theService -p tcp -v v1
#
# NOTE: THIS IS FOR DEMONSTRATION PURPOSES ONLY! ERROR HANDLING IS ABSOLUTE MINIMAL! THIS IS *NOT* PRODUCTION CODE!

import sys
import random
import argparse

import requests
import dns.resolver

def main():
	"""
	Main entry point
	"""

	# Parse command line arguments
	parser = argparse.ArgumentParser(description='Example code to resolve DNS-SD service descriptions')
	parser.add_argument('-z', '--zone', type=str, required=True, dest='zone', help='Zone Name')
	parser.add_argument('-s', '--service', type=str, required=True, dest='service', help='Service Name')
	parser.add_argument('-p', '--protocol', type=str, required=True, dest='protocol', help='Service Transport Protoco [tcp|udp]', choices=['tcp', 'udp'])
	parser.add_argument('-v', '--version', type=str, required=True, dest='version', help='Service Version')

	args = parser.parse_args()
	zone = args.zone
	service = args.service
	protocol = args.protocol
	version = args.version

	# Obtain PTR Record
	service_id = '_{service}._{protocol}.{zone}'.format(service=service, protocol=protocol, zone=zone)
	answer = dns.resolver.query(service_id, 'PTR')

	# Find the service incarnation
	if answer:
		for record in answer.rrset:
			r = str(record.target).split('.')
			if version in r[0]:
				service_version = str(record.target)

	# Discover and consume the actual service
	if service_version:
		# Get SRV and TXT
		answer_srv = dns.resolver.query(service_version, 'SRV')
		answer_txt = dns.resolver.query(service_version, 'TXT')

		service_addr = ''
		service_port = 0

		# If those are valid get random service location entry
		if answer_srv and answer_txt:
			srv_entry = random.choice(answer_srv.rrset.items)
			if srv_entry:
				service_addr = srv_entry.target
				service_port = srv_entry.port

	service_uri = 'http://{host}:{port}/'.format(host=service_addr, port=service_port)
	r = requests.get(service_uri)
	if r.status_code == 200:
		print r.text

if __name__ == '__main__':
	main()

#!/usr/bin/env python

# The following code demonstrates how to resolve DNS-SD Service Descriptions

# Example execution:

# client.py -z unilans.net. -s theService -p tcp -v v1

# NOTE: THIS IS FOR DEMONSTRATION PURPOSES ONLY! ERROR HANDLING IS ABSOLUTE MINIMAL! THIS IS *NOT* PRODUCTION CODE!

import sys

import random

import argparse

import requests

import dns.resolver

def main():

"""

Main entry point

"""

# Parse command line arguments

parser = argparse.ArgumentParser(description='Example code to resolve DNS-SD service descriptions')

parser.add_argument('-z', '--zone', type=str, required=True, dest='zone', help='Zone Name')

parser.add_argument('-s', '--service', type=str, required=True, dest='service', help='Service Name')

parser.add_argument('-p', '--protocol', type=str, required=True, dest='protocol', help='Service Transport Protoco [tcp|udp]', choices=['tcp', 'udp'])

parser.add_argument('-v', '--version', type=str, required=True, dest='version', help='Service Version')

args = parser.parse_args()

zone = args.zone

service = args.service

protocol = args.protocol

version = args.version

# Obtain PTR Record

service_id = '_{service}._{protocol}.{zone}'.format(service=service, protocol=protocol, zone=zone)

answer = dns.resolver.query(service_id, 'PTR')

# Find the service incarnation

if answer:

for record in answer.rrset:

r = str(record.target).split('.')

if version in r[0]:

service_version = str(record.target)

# Discover and consume the actual service

if service_version:

# Get SRV and TXT

answer_srv = dns.resolver.query(service_version, 'SRV')

answer_txt = dns.resolver.query(service_version, 'TXT')

service_addr = ''

service_port = 0

# If those are valid get random service location entry

if answer_srv and answer_txt:

srv_entry = random.choice(answer_srv.rrset.items)

if srv_entry:

service_addr = srv_entry.target

service_port = srv_entry.port

service_uri = 'http://{host}:{port}/'.format(host=service_addr, port=service_port)

r = requests.get(service_uri)

if r.status_code == 200:

print r.text

if __name__ == '__main__':

main()

Copy of the code can be downloaded from https://s3-us-west-2.amazonaws.com/blog.xi-group.com/aws-route53-iam-ec2-dns-sd/client.py

Why would that be better?! Yes, there is added complexity in the name resolution process. But, more importantly, details needed to find the service are agnostic to its location, or specific to the client. Service-specific infrastructure can change, but the client will not be affected, as long as the discovery process is performed.

Sample run:

:~> client.py -z unilans.net. -s theService -p tcp -v v1
<html>
        <head>
                <title>Demo Page</title>
        </head>

        <body>
                <center><h2>Demo Page</h2></center><br>
                <center>Status: running</center>
        </body>
</html>
:~>

:~> client.py -z unilans.net. -s theService -p tcp -v v1

<html>

<head>

</head>

<body>

<center>Status: running</center>

</body>

</html>

:~>

Voilà! Reliable Service Discovery in elastic systems!

Additional Notes

Some additional notes and well-knowns:

Examples in this article could be extended to support fail-over or more sophisticated forms of load-balancing. Current random.choice() solution should be good enough for the generic case;
More complex setup with different priorities and weights can be demonstrated too;
Service health-check before DNS-SD registration can be demonstrated too;
Non-HTTP service can be demonstrated to use DNS-SD. Technology is application-agnostic.
TXT contents are not used throughout this article. Those can be used to carry additional meta-data (NOTE: This is public! Anyone can query your DNS TXT records with this setup!).

Conclusion

Quick implementation of DNS-SD with AWS Route53, IAM and EC2 was presented in this article. It can be used as a bare-bone setup that can be further extended and productized. It solves common problem in elastic systems: Service Discovery! All key components are implemented in either Python or Shell script with minimal dependencies (sudo aptitude install awscli, python-boto, python-requests, python-dnspython), although the implementation is not dependent on a particular programming language.

References

Small Tip: Partitioning disk drives from within UserData script

2014/06/11 AWS, DevOps, Small Tip 2 comments AWS, DevOps, fdisk, instance store, linux, partitioning

In a recent upgrade to the new generation of instances we faced an interesting conundrum. Previous generations came with quite the amount of disk spaces. Usually instance stores are mounted on /mnt. And it is all good and working. The best part, one can leave the default settings for the first instance store and do anything with the second. And “anything” translated to enabling swap on the second instance store. With the new instance types, however the number (and the size) of the instance stores is reduced. It is SSD, but m2.4xlarge comes with 2 x 840 GB, while the equivalent in the last generation, r3.2xlarge, comes with only one 160 GB instance store partition.

Not a problem, just a challenge!

We prefer to use UserData for automatic server setup. After some attempts it became clear that partitioning disks from a shell script is not exactly trivial tasks under Linux in AWS. BSD-based operating systems come with disklabel and fdisk and those will do the job. Linux comes with fdisk by default and that tool is somewhat limited …

Luckily, fdisk reads data from stdin so quick-and-dirty solution quickly emerged!

The following UserData is used to modify the instance store of a m3.large instance, creating 8GB swap partition and re-mounting the rest as /mnt:

#!/bin/bash -ex

# Mark execution start
echo "STARTING" > /root/user_data_run

# Unmount /dev/xvdb if already mounted
umount -f /dev/xvdb

# Partition the disk (8GB for SWAP / Rest for /mnt)
(echo n; echo p; echo 1; echo 2048; echo +8G; echo t; echo 82; echo n; echo p; echo 2; echo; echo; echo w) | fdisk /dev/xvdb

# Make and enable swap
mkswap /dev/xvdb1
swapon /dev/xvdb1

# Make /mnt partition and mount it
mkfs.ext4 /dev/xvdb2
mount /dev/xvdb2 /mnt
sed -i s/xvdb/xvdb2/g /etc/fstab

# Mark execution end
echo "DONE" > /root/user_data_run

#!/bin/bash -ex

# Mark execution start

echo "STARTING" > /root/user_data_run

# Unmount /dev/xvdb if already mounted

umount -f /dev/xvdb

# Partition the disk (8GB for SWAP / Rest for /mnt)

(echo n; echo p; echo 1; echo 2048; echo +8G; echo t; echo 82; echo n; echo p; echo 2; echo; echo; echo w) | fdisk /dev/xvdb

# Make and enable swap

mkswap /dev/xvdb1

swapon /dev/xvdb1

# Make /mnt partition and mount it

mkfs.ext4 /dev/xvdb2

mount /dev/xvdb2 /mnt

sed -i s/xvdb/xvdb2/g /etc/fstab

# Mark execution end

echo "DONE" > /root/user_data_run

Execute it with AWS CLI (Using stock Ubuntu 14.04 HVM AMI):

aws ec2 run-instances --image-id ami-1d8c9574 --count 1 --instance-type m3.large --key-name test-key --security-groups test-sg --user-data file://userdata.sh

1	aws ec2 run-instances --image-id ami-1d8c9574 --count 1 --instance-type m3.large --key-name test-key --security-groups test-sg --user-data file://userdata.sh

The result:

:~> ssh ubuntu@ec2-54-197-66-121.compute-1.amazonaws.com "df -h"
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  765M  6.6G  11% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            3.7G   12K  3.7G   1% /dev
tmpfs           749M  336K  748M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            3.7G     0  3.7G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdb2       22G   44M   21G   1% /mnt
:~> ssh ubuntu@ec2-54-197-66-121.compute-1.amazonaws.com "free -h"
             total       used       free     shared    buffers     cached
Mem:          7.3G       276M       7.0G       352K       8.6M       177M
-/+ buffers/cache:        90M       7.2G
Swap:         8.0G         0B       8.0G
:~>

:~> ssh ubuntu@ec2-54-197-66-121.compute-1.amazonaws.com "df -h"

Filesystem Size Used Avail Use% Mounted on

/dev/xvda1 7.8G 765M 6.6G 11% /

none 4.0K 0 4.0K 0% /sys/fs/cgroup

udev 3.7G 12K 3.7G 1% /dev

tmpfs 749M 336K 748M 1% /run

none 5.0M 0 5.0M 0% /run/lock

none 3.7G 0 3.7G 0% /run/shm

none 100M 0 100M 0% /run/user

/dev/xvdb2 22G 44M 21G 1% /mnt

:~> ssh ubuntu@ec2-54-197-66-121.compute-1.amazonaws.com "free -h"

total used free shared buffers cached

Mem: 7.3G 276M 7.0G 352K 8.6M 177M

-/+ buffers/cache: 90M 7.2G

Swap: 8.0G 0B 8.0G

:~>

There it is, 8GB swap partition (/dev/xvdb1) and the rest (/dev/xvdb2) mounted as /mnt. Note that /etc/fstab is also updated to account for the device name change!

The “Aggressive DevOps” Concept

2014/06/09 DevOps, theCloud, Xi Group Ltd. Aggressive DevOps, data driven control plane, DevOps, full-stack monitoring, process automation

Xi Group Ltd. is involved in operational and DevOps activities for several years now and during that period several lessons became clear:

Not all architectures are cloud-friendly;
You need to know your data flows to benefit from the cloud services;
Many companies use theCloud only as easy provisioning technology;
Elasticity is hard to implement!

Lets leave the former two for another blog posts and concentrate on the the latter.

Many companies use theCloud only as easy provisioning technology

… and it is natural when you come from a static infrastructure world. However, this is not what theCloud is all about. Yes, you can use it for that, in same way you can use a steam roller to iron your shirts. Yes, there are other arguments why you may want to put something in theCloud, but just copying your infrastructure in theCloud and stating that this makes it resilient of fault-tolerant is plain stupid! AWS had outages, by extension Heroku had outages, Joyent had outages … you need to design for reliability to achieve reliability. “Easy provisioning” is good, but if it’s your main reason, stick with the physical infrastructure. It is probably cheaper in the long run. However, if you want to fully utilize this technology, keep reading!

Elasticity is hard to implement!

Many workloads are elastic in nature, from website traffic to batch processing. The elastic nature of the workload may look like a sine wave or like a spike, the logic is the same. You’d generally want to adjust the amount of resources you allocate so that it’s enough to cover the workload, but also minimize the price you pay. And for any practical problem it is damn hard to do so! Why?! We identified the following reasons:

Software environment is chaotic;
Software deployment is non-trivial task;
Tightly-coupled architectures;
Data store functionality misaligned with the nature of the data;
Data stores inherently rigid;
Lack of operational information;
Monitoring wrong resources;
Monitoring returns unusable information;
There is no ‘control plane’ in the system;
‘Control plane’ depends on infrastructure stability;
… and many, many others …

Those helped identify the following “pillars” as the foundation for large-scale elastic deployments:

Build & Deployment Automation
Full-stack Application Monitoring
Data-driven Control Plane

Aggressive DevOps is the process of implementing the pillars in a business environment!

It is our company mission to develop and provide the tooling, the services and the know-how for others to implement large-elastic deployments.

Over a series of blog posts we will go into further details about each of Aggressive DevOps pillars. Stay tuned!

Small Tip: Use AWS CLI to create instances with bigger root partitions

2014/06/05 AWS, DevOps, Small Tip 2 comments AWS, AWS CLI, bigger, DevOps, linux, root partition

On multiple occasions we had to deal with instances running out of disk space for the root file system. AWS provides you reasonable amount of storage, but most operating systems without additional settings will just use the root partition for everything. Which is usually sub-optimal, since default root partition is 8GB and you may have 160GB SSD just mounted on /mnt and never used. With the AWS Web interface, it is easy. Just create bigger root partitions for the instances. However, the AWS CLI solution is not obvious and somewhat hard to find. If you need to regularly start instances with non-standard root partitions, manual approach is not maintainable.

There is a solution. It lies in the –block-device-mappings parameter that can be passed to aws ec2 run-instances command.

According to the documentation this parameter uses JSON-encoded block device mapping to adjust different parameter of the instances that are being started. There is a simple example that shows how to attach additional volume:

--block-device-mappings "[{\"DeviceName\": \"/dev/sdh\",\"Ebs\":{\"VolumeSize\":100}}]"

1	--block-device-mappings "[{\"DeviceName\": \"/dev/sdh\",\"Ebs\":{\"VolumeSize\":100}}]"

This will attach additional 100GB EBS volume as /dev/sdb. The key part: “Ebs”: {“VolumeSize”: 100}

By specifying your instance’s root partition you can adjust the root partition size. Following is an example how to create Amazon Linux instance running on t1.micro with 32GB root partition:

aws ec2 run-instances --image-id ami-fb8e9292 --count 1 --instance-type t1.micro --key-name test-key --security-groups test-sg --block-device-mapping "[ { \"DeviceName\": \"/dev/sda1\", \"Ebs\": { \"VolumeSize\": 32 } } ]"

1	aws ec2 run-instances --image-id ami-fb8e9292 --count 1 --instance-type t1.micro --key-name test-key --security-groups test-sg --block-device-mapping "[ { \"DeviceName\": \"/dev/sda1\", \"Ebs\": { \"VolumeSize\": 32 } } ]"

The resulting volume details show the requested size and the fact that this is indeed root partition:

Confirming, that the instance is operating on the proper volume:

:~> ssh ec2-user@ec2-50-16-57-145.compute-1.amazonaws.com "df -h"
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       32G  1.1G   31G   4% /
devtmpfs        282M   12K  282M   1% /dev
tmpfs           297M     0  297M   0% /dev/shm
:~>

:~> ssh ec2-user@ec2-50-16-57-145.compute-1.amazonaws.com "df -h"

Filesystem Size Used Avail Use% Mounted on

/dev/xvda1 32G 1.1G 31G 4% /

devtmpfs 282M 12K 282M 1% /dev

tmpfs 297M 0 297M 0% /dev/shm

:~>

There is enough space in the root partition now. Note: This is EBS volume, additional charges will apply!

References

aws ec2 run-instances help

Small Tip: AWS announces T2 instance types

Related Posts

Small Tip: How to run non-deamon()-ized processes in the background with SupervisorD

Related Posts

Small Tip: EBS volume allocation time is linear to the size and unrelated to the instance type

Related Posts

Small Tip: Partitioning disk drives from within UserData script

Related Posts

The “Aggressive DevOps” Concept

Related Posts

Small Tip: Use AWS CLI to create instances with bigger root partitions

Related Posts

Categories

Recent Posts