Amazon recently introduced the AWS Reserved Instances Marketplace. The idea is great – allow people to sell their reserved instances which they don’t need for whatever reason instead of losing the reservation money (or if you are in heavy utilization – the complete run cost of the instance 24 x 7 x Number of years you reserved).
Before you can sell a reserved instance you need to setup various details to which Amazon will wire the money – however if you are not located in the US or have a US bank account you are out of luck. Unfortunately for me – I’m located in Israel with no US bank account.
Instead of messing with various taxing issues I would like to suggest AWS to simply give back AWS credits. That is – if I sell my reserved instance for $100 I should have the option of directly crediting my AWS account with $100 which I can then use on various AWS services.
I know AWS has some mechanism to work with such a thing since they do give out gift/trial credits all the time. I also know that the Amazon Associates program for referring customers to Amazon can give you back Amazon gift certificates instead of actual money.
Just a thought that would keep the money inside the AWS ecosystem while making non US customers happy.
Hosted solutions pricing plans use varied parameters such as price/server, price/metric, retention policy, # of metrics tracked, realtime-ness, etc.
Poll based method – where collecting server polls the other servers/service vs. Push – where you have a client on the server that pushes locally collected data to logging/monitoring server
Allowing custom metrics – not all systems allows monitoring, plotting, sending and alert on custom data (at least not in a easy manner)
Some of these systems are better suited to some tasks more than the others but in general none of them provides a (good) solution for handling todays monitoring needs that spans from operational to applicative.
My ideal monitoring system for any application that have servers running in the background should have the following features:
Hosted – for when it doesn’t make sense for me to run the operations of this
Open Source – for when the sweet spot leans towards me taking control of the operations and management of collecting my own statistics with a CLEAR path of migration between the hosted solution and my installed one
Suitable for a cloud / virtual server environment – where servers go up and down and each server simply reports its data to the monitoring system without the need to pre-register it with the system. This suggests a small client running on each machine collecting local stats and relaying it to a central system
Supports custom metrics – allowing me to add whatever stats I want be it operational (CPU, network bandwidth, disk space) or application related (such as number of sign ups or a specific function run time in milliseconds)
Understand more than numbers – not all stats are equal. Some are counters which I just want to say “increment” or “decrement”. Others are single data points that I need to simply store. Others are special data points with a unit of measure (such as “milliseconds”, “Bytes/second”, etc)
Locally installed client must handle network failures – if there is a network failure or a collecting server down time, stats will be stored locally and relayed to the collecting server when its available again
Locally installed client should collect custom metrics – if I want to send some custom metrics from my app – say when a user signs up – my code would talk with the locally installed client and that client will relay the data to the collecting server. This ensures minimum configuration and my app code can assume that there is always a locally installed client which can communicate with the collecting server be it via UNIX sockets, UDP datagram, shared memory or anything else that is suitable for the job
Data should be query-able – that is, I really want to query and filter more than just the timeframes of the data and general statistics on it (i.e. group by server, show specific servers, show values higher than X, etc)
Reporting Console – somewhere to plot all these statistics which has embeddable graphs (for those who likes building their own dashboards)
Built-in near real-time alerts – I want to be able to set alerts that go out near real time when collecting the data to a diverse set of outlets be it Email, Text Messages, Push Notifications, WebHook (for automating some auto handling of failures or problems), etc.
API – because everything needs it 🙂
It is very important to me in almost any hosted SaaS (Software-as-a-Service) solution I use that I will have a clear migration path if (or when) the time comes and I need to host a certain sub-system on my own. Sometimes I do have to compromise and use a system that I may not have the ability to migrate (or at least not easily) but the decision is made consciously.
From an architecture point of view, I would like to see these main building blocks:
Storage – Reliable, scalable, can handle lots of writes fast and handle any size of dataset for any reasonable retention period
Collectors – clients push data to these collectors which gets it and pass it on the processors
Processors – Handle incoming data to be written. Aggregate data for quicker reporting.
Reporting – something that will enable easy querying and filtering of the data
Real time alerts monitoring – handle preconfigured alerts and figuring in near real time if certain conditions are met to issue the relevant alerts/actions
Web Console – for configuration, querying and real-time plotting of data
API for querying
API for real time plotting – to be used for integration with other apps, embeddable chunks of code, etc.
While I’m sure with a little more thought more requirements can be added or some of these requirements can be merged and minimized, this set of features will create a system a lot of people would love to use and feel comfortable using.
Would you use such a system? Do you have anything else to add to the feature set?
The good guys at Amazon Web Services just announced a new feature in Simple Notifications Service (SNS) which allows settings a retry policy for SNS notifications.
Up until now SNS had an unknown publish retry policy (and maybe non existing). I always suspected it had some logic for different subscription types (Email, HTTP, Text, SQS) but it was never mentioned anywhere.
The new retry policy feature allows you to define the number of retries and the wait period between them (even if its a linear or exponential wait!) as well as set a throttling policy so that if your server is currently down it won’t get flooded with notifications once its back up.
This allows for some very interesting patterns. Most notably is push based messaging mechanism in which instead of writing a dedicated process to poll an SQS queue you can use SNS as sort of an ad-hoc push queue that will post the messages to an HTTP/S URL. Setting a reasonable retry policy and throttling policy will also ensure that if your server is down, messages won’t get lost.
I posted a while back a suggestion for a hack which utilized SNS as a notification mechanism to start polling SQS, however now that SNS has a retry policy its a good candidate for allowing you to handle your async tasks using your regular HTTP servers with all the goodies of logging, multi-threading, debugging, etc.
Before you run to start implementing a push based messaging (or re-implement Google AppEngine’s TaskQueue Push Queue API), there are certain things which are yet unknown and/or require further consideration:
SQS has a 15 days storage policy, so you have up to 15 days to fix a bug or setup a system that will empty a queue. In the new SNS retry policy you may reach a similar long period of time however, the maximum values to set in the policy are not yet known and may pose a limit.
I am no aware and couldn’t find any documentation which related to the HTTP status code of a message pushed to an HTTP/S subscriber (other than answering the subscription request). How can you tell SNS that if a message was pushed to an HTTP subscriber, the subscriber failed due to an HTTP error? In that case will SNS consider a non HTTP Status 200 request a failed request and will do the retry policy?
What happens if a message pushed to an HTTP/S subscriber takes a long time to process due to load or any other reason? When will SNS decide if the request failed due to timeout?
I hope some of these questions will get cleared up and SNS can become a viable push based messaging mechanism.
I previously recommended to always push via SNS (even if its just to an SNS queue) just to get the added benefits of easier debugging (subscribe to an SQS queue AND send email or HTTP request to debug, etc). The new features only proves that its becoming a very interesting building block to use inside and/or outside of AWS.
In a recent post on the AWS blog, Jeff Barr and Matt Wood, showed the architecture and code they wrote which lists the most interesting AWS related jobs from the Amazon Jobs site.
It serves as a rather good example of how service components such as the ones AWS provides (SNS, SQS, S3 to name a few that are AWS agnostic) a great set of building blocks that can easily help you focus on writing the code you really need to write.
I found the auto scaling policy for spinning up & down machines just to tweet a bit of an over kill at first (and Jeff could have easily added the code on the same instance running the cron), however thinking about it a bit more and considering the various pricing strategies it actually makes a lot sense.
I just read a post on Slashdot about a poor guy getting a huge chunk of Netflix traffic to his server.
The problem seemed to have been caused by the nature of IP address in EC2 which are quite fluid and gets reassigned when you spin up and down a machine. The same goes for Elastic Load Balancers (ELB) which are managed by Amazon and may switch the IP address as well (that’s why they ask to map to their CNAME record for the ELB instead of the IP).
In the Slashdot post, there is a link to this article, which describes the problem and lists some possible implications and possible ways of avoiding leaking data such as passwords and session ids when such a problem occurs.
The article mostly talks about what happend if someone hijacks your ELB, but the original problem reported was that you accidentally got someone elses traffic. This can lead to some other severe consequences:
Your servers crashing (in which case you should probably notice that rather quickly. Duh!)
If you are running some kind of a content site that depends on SEO and crawlers picked on the wrong IP, you might end up with a HUGE SEO penalty because another site’s content will be crawled on your domain
There is a very simple and quick solution for the problem I am describing above. Make sure you configure your web server to answer only to YOUR host names. Your servers will return response ONLY for a preconfigured set of hostnames, so if you get Netflix traffic, which probably has netflix.com hostname, your server will reject it immediately.
You can easily configured that in Nginx, Apache or if you have a caching proxy such as Varnish or squid.
A better solution for this problem is to add hostname checks support to ELB itself. I’ve posted a feature request on the AWS EC2 forum with the hopes that it will get implemented.
When talking about the cloud, most people talk about running the servers on the cloud. They talk about the fact that they can start or stop virtual servers with a simple API call.
A lot of Cloud Providers do provide you with virtual servers, virtual load balancer and infinite (and/or scalable) storage, but these are all building block of servers. You still need to do the heavy lifting of taking a bunch of servers and making it do the work for you. You need to install the software (not just the one you wrote), mange, handling disaster recovery, backups, logging, etc.
What makes Amazon’s cloud unique on top of the servers and storage they provide is the fact that it provides a set of cloud services that can be used as black boxes for your application/service reducing the code you need to write/maintain as well as the number of servers you need to administer. These services are also available for use outside of Amazon’s cloud but there are benefits of using them within your EC2 instances.
The Amazon non-hardware cloud services offering is split into two categories:
Off the shelf services – These are preconfigured services that Amazon takes most of the hassle off of managing it. Such services are
Amazon ElastiCache – Hosted Memcached which Amazon manages, allows to resize (add or remove servers) as well as patch with the latest software.
Development Building Blocks (Black Boxes) – Web services that provides functionality which you can mix and match to create you service, removing the need for you to handle the load, machines and configuration of these services. Such services are:
Amazon SimpleDB – an Amazon written key/value datastore that is hosted and operated by Amazon. Scalable and simple. It just works
Amazon Simple Notification Service (SNS) – A web service to send notifications to people/machines/service/etc. It allows to send the notification out as an Email or as an HTTP request as well as post it to an SQS queue
What I love the most about Amazon Web Services is the fact that when they do provide a certain Development Building Block such as Simple Email Service (SES), they do so without killing or harming the competition. There is still enough value and features in other Email services such as mailgun, SendGrid and MailChimp for them to co-exist with SES.
Not stepping (too much) on web services developers toes is not something to dismiss and I would love to see the innovation that comes out of Cloud based web services in the future.