AWS SNS Retry Policy – New Feature – Push Based Messaging Instead of SQS Polling

The good guys at Amazon Web Services just announced a new feature in Simple Notifications Service (SNS) which allows settings a retry policy for SNS notifications.

Up until now SNS had an unknown publish retry policy (and maybe non existing). I always suspected it had some logic for different subscription types (Email, HTTP, Text, SQS) but it was never mentioned anywhere.

PUSH
by Ed Russel

The new retry policy feature allows you to define the number of retries and the wait period between them (even if its a linear or exponential wait!) as well as set a throttling policy so that if your server is currently down it won’t get flooded with notifications once its back up.

This allows for some very interesting patterns. Most notably is push based messaging mechanism in which instead of writing a dedicated process to poll an SQS queue you can use SNS as sort of an ad-hoc push queue that will post the messages to an HTTP/S URL. Setting a reasonable retry policy and throttling policy will also ensure that if your server is down, messages won’t get lost.

I posted a while back a suggestion for a hack which utilized SNS as a notification mechanism to start polling SQS, however now that SNS has a retry policy its a good candidate for allowing you to handle your async tasks using your regular HTTP servers with all the goodies of logging, multi-threading, debugging, etc.

Before you run to start implementing a push based messaging (or re-implement Google AppEngine’s TaskQueue Push Queue API), there are certain things which are yet unknown and/or require further consideration:

  • SQS has a 15 days storage policy, so you have up to 15 days to fix a bug or setup a system that will empty a queue. In the new SNS retry policy you may reach a similar long period of time however, the maximum values to set in the policy are not yet known and may pose a limit.
  • I am no aware and couldn’t find any documentation which related to the HTTP status code of a message pushed to an HTTP/S subscriber (other than answering the subscription request). How can you tell SNS that if a message was pushed to an HTTP subscriber, the subscriber failed due to an HTTP error? In that case will SNS consider a non HTTP Status 200 request a failed request and will do the retry policy?
  • What happens if a message pushed to an HTTP/S subscriber takes a long time to process due to load or any other reason? When will SNS decide if the request failed due to timeout?
I hope some of these questions will get cleared up and SNS can become a viable push based messaging mechanism.
I previously recommended to always push via SNS (even if its just to an SNS queue) just to get the added benefits of easier debugging (subscribe to an SQS queue AND send email or HTTP request to debug, etc). The new features only proves that its becoming a very interesting building block to use inside and/or outside of AWS.

 

 

The cool example of SaaS for developers by @mza and @jeffbarr

In a recent post on the AWS blog, Jeff Barr and Matt Wood, showed the architecture and code they wrote which lists the most interesting AWS related jobs from the Amazon Jobs site.

It serves as a rather good example of how service components such as the ones AWS provides (SNS, SQS, S3 to name a few that are AWS agnostic) a great set of building blocks that can easily help you focus on writing the code you really need to write.

I found the auto scaling policy for spinning up & down machines just to tweet a bit of an over kill at first (and Jeff could have easily added the code on the same instance running the cron), however thinking about it  a bit more and considering the various pricing strategies it actually makes a lot sense.

Membase Cluster instead of ElastiCache in 5 minutes

Want to have an ElastiCache like service in all regions, not just US-EAST?Memby
Want to utilize the new reserved instances utilization model to lower your costs?
Want to have your cache persistent and easily backup and restore it?
Want to add (or remove) servers from your cache cluster with no down time?

We have just the solution for you. A simple CloudFormation template to create a Membase cluster which gives you all of the ElasticCache benefits and a lot more including:

  • Support for ALL regions (not just US-EAST Apparently Amazon beat me to the punch with support at US West (N. California), EU West (Dublin), Asia Pacific (Singapore), and Asia Pacific (Tokyo))
  • Support for reserved instances including the new utilization based model
  • Supports adding and removing servers to the cluster with no downtime and automatic rebalancing of keys amont the cluster’s servers
  • Support persistency (if you wish)
  • Supports multi-tenancy and SASL authentication
  • Supports micro instances (not recommended, but good for low volume environments and testing environments)
  • Easily backup and restore
  • Install the cluster in multiple availability zones and regions for maximum survivability and have all of them talk to each other
  • Using a vBucket aware memcache client or running through a Moxi proxy changes in topology will get communicated automatically with no need for code or configuration changes!
  • No need for a single address (the CNAME you get from ElastiCache) because if you are using a vBucket aware client (or going through Moxi Proxy) topology changes are communicated to your client.
Based on the CloudFormation scripts created by Couchbase Labs, this script is more generic and utilize the complete available RAM (80% of available ram) for each type of instance.

Notes:

  • All instances run the latest Amazon Linux (2011-09)
  • 64bit instances use instance-store for storage
  • m1.micro uses 32bit to utilize maximum amount of RAM (we don’t want those 64bit pointers eating our RAM)
  • m1.small is 32bit and while it does have instance-store support, we wanted to have a one formation script to rule them all so it uses an EBS supported AMI.
  • The CloudFormation script changes the names of the instance to their public DNS names so they are available from anywhere in the world and any Availability Zone and Region in AWS so you can have repl

Security (VERY IMPORTANT!):

  • The default script will create a security group for you which allows access to ALL of the servers.
  • If you created a default bucket via the script – that bucket, which uses port 11211 is open to everyone. Make sure to protect it or delete it and create a bucket with SASL protection.
  • In general, if you are creating a non SASL protected bucket, make sure to protect the port by updating the security group!

Download:

Instructions:

  1. Grab one of the existing templates from the GitHub repository or run gen-pack.py to generate a template with any number of instances in it.
  2. Go to the CloudFromation Console
  3. Click on “Create New Stack”
  4. Name your stack
  5. Select “Upload a Temaplate File”
  6. Click “Browse” and select the template file (use one of the defaults named “membase-pack-X”)
  7. Click “Continue”
  8. In “Specify Parameters” step the minimum parameters you are required to fill are:
    • RESTPassword – the password for the REST API and management console
    • KeyName – the name of the KeyPair you are using when creating instances (usually the name of the .pem file without the “.pem” suffix)
    • Instance Type – choose the size of the instance (t1.micro by default)
  9. Click “Continue”
  10. Sit back and relax. CloudFormation will do the rest of the work for you
  11. Select the “Outputs” tab (click refresh if it doesn’t show anything). It should contain the URL of the Membase management console
  12. Clap your hands with joy!

Script to update Beanstalkd work queue statistics in CloudWatch

I’ve written a small Python script which uses the excellent boto libraryto monitor Beanstalkd server statistics as well as a specific beanstalkd tube (queue) statistics.

Beanstalkd
By Evil Cheese Scientist (Back Intact!)

You can get the code herehttps://github.com/erans/beanstalkdcloudwatch

The easiest way to run it is via a cron job. I run it every 1 minute to monitor the “reserved” and “buried” state of a few of the tubes I use (if you want to read more about how beanstalkd works I suggest reading the protocol document).

I highly recommending checking beanstalkd out if you need a queue for offloading tasks to other processes/machines. It’s really geared towards that task and have a couple of dedicated features such as the buried state, time-to-live and time-to-run which makes managing this really painless.

AWS Elastic (accidental) Load Balancer Man-in-the-middle Attack

I just read a post on Slashdot about a poor guy getting a huge chunk of Netflix traffic to his server.

The problem seemed to have been caused by the nature of IP address in EC2 which are quite fluid and gets reassigned when you spin up and down a machine. The same goes for Elastic Load Balancers (ELB) which are managed by Amazon and may switch the IP address as well (that’s why they ask to map to their CNAME record for the ELB instead of the IP).

In the Slashdot post, there is a link to this article, which describes the problem and lists some possible implications and possible ways of avoiding leaking data such as passwords and session ids when such a problem occurs.

The article mostly talks about what happend if someone hijacks your ELB, but the original problem reported was that you accidentally got someone elses traffic. This can lead to some other severe consequences:

  • Your servers crashing (in which case you should probably notice that rather quickly. Duh!)
  • If you are running some kind of a content site that depends on SEO and crawlers picked on the wrong IP, you might end up with a HUGE SEO penalty because another site’s content will be crawled on your domain
There is a very simple and quick solution for the problem I am describing above. Make sure you configure your web server to answer only to YOUR host names. Your servers will return response ONLY for a preconfigured set of hostnames, so if you get Netflix traffic, which probably has netflix.com hostname, your server will reject it immediately.

You can easily configured that in Nginx, Apache or if you have a caching proxy such as Varnish or squid.

A better solution for this problem is to add hostname checks support to ELB itself. I’ve posted a feature request on the AWS EC2 forum with the hopes that it will get implemented.

URL Considerations When Using Amazon CloudFront Origin Pull

CloudFront is a great cost effective Content Delivery Network (CDN). When it first started it only supported files located on Amazon’s Simple Storage Service (S3) and on November 2010 Amazon releasedthe “Origin Pull” feature. Origin Pull allows defining a CDN distribution that pull content directing from a preconfigured site (preconfigured hostname) instead of pulling the content from S3.

Cloud Front
by outdoorPDK

The benefits of using the Origin Pull feature includes:

  • No need to sync an S3 bucket with your static resources (CSS, Images, Javascripts)
  • You can serve via the CDN dynamically generated content (like modified images or text fiels) without pre-generating it and putting it inside an S3 bucket.
One of the problems that may occur when intorudcing any caching mechanism is the need to invalidate all or parts of the data. CloudFront provides an invalidation API, however, it has various limitations such as:
  • You need to call it on each object
  • First 1,000 requests are free, each additional one will cost $0.005.
  • It may take up to 15 minutes for the cache to actually clear from all edge locations
There are some techniques to avoid calling the invalidation API but using versioned URLs.

What are versioned URLs?

A versioned URL contain a version part in it, i.e. “http://cdn.example.com/1.0/myimage.jpg”. The version part doesn’t affect the content of the URL, but since the URL to the resource is different, systems using the URL as a key for caching will think of URLs with 2 different version as 2 different resources.
It’s a nice trick to use when you want to quickly invalidate URLs and make a client pull a different/modified version of a resource.

Versioned URLs granularity

You can determine the granularity of the version value to suite your needs. The granularity will allow you to invalidate as little as one file, or every file served via the origin pull in your application.

Common granualirty levels are:
  • A value determined by the build version (i.e. invalidate all static CSS, JS and images one every new build deployed)
  • A value in the configuration, updated automatically or manually to invalidte parts or all of the objects
  • An automatically generated value per file determined by the file content by utilizing a hash function
  • An automatically generated value per file determined by its last modification date

CloudFront will disregard URL query string versioning

Amazon CloudFront (and quite a few other CDN providers) disregard the query string value of a URL (the part after the question mark), whether it is served from an S3 bucket or via an origin pull. This means you will have to rewrite your URLs to contain the version part inside the URL itself. For example:
  • CloudFront will disregard a versioned URL of the following format and consider both URLs the same resource:
    • http://cdn.example.com/css/myfile.css?v123
    • http://cdn.example.com/css/myfile.css?v333
  • CloudFront will consider these 2 URLs 2 different resources:
    • http://cdn.example.com/css/v123/myfile.css
    • http://cdn.example.com/css/v333/myfile.css
You can easily use Apache Rewrite module or Nginx URL rewriting to quickly rewrite the URL http://cdn.example.com/css/v123/myfile.css to http://cdn.example.com/css/myfile.css.
Some common web frameworks put the versioning part in the query string. Be minded about that and change the code appropriately to place the version part somewhere in the URL path (before the question mark).

I would recommend using CloudFront or any other CDN supporting origin pull in any project as it will significantly reduce the loading time of your pages with minimal cost and reduce the load on your servers. It’s a great, quick and easy way to make your site (or even API) work much better.

The cloud is more than just the hardware – its also about services!

When talking about the cloud, most people talk about running the servers on the cloud. They talk about the fact that they can start or stop virtual servers with a simple API call.

Cloud Services Toolbox
taken by keepthebyte

A lot of Cloud Providers do provide you with virtual servers, virtual load balancer and infinite (and/or scalable) storage, but these are all building block of servers. You still need to do the heavy lifting of taking a bunch of servers and making it do the work for you. You need to install the software (not just the one you wrote), mange, handling disaster recovery, backups, logging, etc.

What makes Amazon’s cloud unique on top of the servers and storage they provide is the fact that it provides a set of cloud services that can be used as black boxes for your application/service reducing the code you need to write/maintain as well as the number of servers you need to administer. These services are also available for use outside of Amazon’s cloud but there are benefits of using them within your EC2 instances.

The Amazon non-hardware cloud services offering is split into two categories:

  • Off the shelf services – These are preconfigured services that Amazon takes most of the hassle off of managing it. Such services are
    • Amazon Relational Database Services (RDS) – Hosted MySQL which Amazon manage, patches and helps you backup and restore as well as deploy in a smart way across multiple availability zones.
    • Amazon ElastiCache – Hosted Memcached which Amazon manages, allows to resize (add or remove servers) as well as patch with the latest software.
  • Development Building Blocks (Black Boxes) – Web services that provides functionality which you can mix and match to create you service, removing the need for you to handle the load, machines and configuration of these services. Such services are:
    • Amazon SimpleDB – an Amazon written key/value datastore that is hosted and operated by Amazon. Scalable and simple. It just works
    • Amazon Simple Queue Service (SQS) – A reliable, scalable, persistent queue service. Great for talking with distributed async components.
    • Amazon Simple Email Service (SES) – A scalable Email service which allows you to send Email (for humans and machines)
    • Amazon Simple Notification Service (SNS) –  A web service to send notifications to people/machines/service/etc. It allows to send the notification out as an Email or as an HTTP request as well as post it to an SQS queue

What I love the most about Amazon Web Services is the fact that when they do provide a certain Development Building Block such as Simple Email Service (SES), they do so without killing or harming the competition. There is still enough value and features in other Email services such as mailgun, SendGrid and MailChimp for them to co-exist with SES.

Not stepping (too much) on web services developers toes is not something to dismiss and I would love to see the innovation that comes out of Cloud based web services in the future.

Go cloud services! 🙂

ec2ssh – Quicker SSH-ing Into Your EC2 Instances

Ever since Amazon introduced tags in EC2 I felt such a relief that I can name my instances and actually remember which one is which.

It took a while for Name tags to find its way to various aspects of the AWS Console, however, connecting to machines still requires a way to find the IP address from the console or via a command line using the EC2 API Tools.

I thought that it would be much easier for me and others to utilize the Name tags to connect more easily to the machines.

Initially, the script was a simple bash script which utilizes a the ec2-describe-instances command with a flag that matched the Name attribute, however, managed various other command line parameters such as the user name to connect with (ubuntu images, for example, users the ‘ubuntu’ user instead of ‘root’. Amazon Linux AMI uses ‘ec2user’, etc) so I’ve decided to rewrite it in Python and use the magnificent Boto Python library.

This allowed better argument handling as well as remove the need to install Java and the EC2 API Tools to access the EC2 API.

Grab the code here

Don’t forget to send some feedback!

 

Recommendation: Consider Submitting Messages to Amazon’s Simple Queue Service (SQS) via Simple Notification Service (SNS)

Thinking a bit more about my last post, I’d come to the conclusion that unless there is a really good excuse, one should always submit items to Amazon’s Simple Queue Service (SQS) via publishing to a topic in Simple Notification Service (SNS).

In the simplest case, you’ll publish to a topic and that topic will have a single subscriber that will post the message to the queue.

However, since SNS allows multiple subscribers you can get a couple of features free of charge without changing a single line of code. For example you can:

  • Add a temporarily an Email address to get messages going to the queue via Email for easy debugging (you can easily and quickly unsubscribe via the link at the end of the Email)
  • Add additional logging by adding an HTTP/S subscriber, getting the message and perform some logging on it
  • Notify other monitoring systems that a certain process has started
I know that from now on I’ll try to think really hard if I really need to publish directly to a queue instead of using SNS.

Using Amazon’s Simple Notification Service (SNS) & Simple Queue Service (SQS) For a Reliable Push Based Processing of Messages

I’ve recently encountered a use case where I need to reliably send a message to a worker process that should handle the following requirements:

  • Message should persist until the action it specified is done.
  • Message should be processed (or wait to be processed) when the worker processes have a bug or are down
  • Message processing should be as fast as possible – process the message ASAP.

Using Amazon’s Simple Queue Service (SQS)

SQS can provide the persistency needed. The message will be available until it is deleted (at least up to ~3 days) even if the worker processes are down or have a bug.

The problem with using SQS is that it requires polling which introduces a certain delay between the time a message is published and until it is processed. That delay can be small, a couple of seconds, but can easily be up to 30 seconds and more (depending on the limitations of SQS polling and the polling interval used).

Using Amazon’s Simple Notification Service (SNS)

SNS can provide a push mechanism to notify in near-real-time to a subscriber that a message has been published.
However, SNS can only guarantee a single delivery to each subscriber of a given topic. This means that if there was a bug or a problem processing the message and there was no specific code to save it somewhere, the message is lost.

The Solution

SQS and SNS can be combined to produce a PUSH mechanism to reliably handle messages.
The general configuration steps are:
  • Create a topic
  • Subscribe an SQS queue to that topic
  • Subscribe a worker process that work via HTTP/S to that topic (for increased reliability this can be an Elastic Load Balancer (ELB) that hides a set of machines)
Then the flow goes like this:
  • Submit a message to the topic
  • SNS will publish the message to the SQS queue
  • SNS will then notify via HTTP/S POST to the handler to start processing the message
When the worker process gets the HTTP/S POST from SNS it will start polling the queue until it has no items in the queue and finish the HTTP request.
To handle failures when the worker process has a bug or is down or did not get the SNS message, a regular worker process can run and poll the queue in regular, longer, intervals to make sure all messages are processed and no one gets behind.

This solution covers the original 3 requirements of message reliability, handling cases where workers are down or have bugs and handling messages as soon as they are sent.