Thinking a bit more about my last post, I’d come to the conclusion that unless there is a really good excuse, one should always submit items to Amazon’s Simple Queue Service (SQS) via publishing to a topic in Simple Notification Service (SNS).
In the simplest case, you’ll publish to a topic and that topic will have a single subscriber that will post the message to the queue.
However, since SNS allows multiple subscribers you can get a couple of features free of charge without changing a single line of code. For example you can:
- Add a temporarily an Email address to get messages going to the queue via Email for easy debugging (you can easily and quickly unsubscribe via the link at the end of the Email)
- Add additional logging by adding an HTTP/S subscriber, getting the message and perform some logging on it
- Notify other monitoring systems that a certain process has started
I know that from now on I’ll try to think really hard if I really need to publish directly to a queue instead of using SNS.
UPDATE (2011-07-16): I just got a newsletter Email from Amazon stating that they have added SQS and SNS to CloudWatch which allows monitor SQS queues not just for the length of the queue, but for others metrics as well, so there is no real need in my script. Unless you really really want to use it 🙂
All you have to do is select SQS in the metrics type drop down and you will see a set of metrics to select from for all of your existing queues.
Amazon’s CloudWatch is a great tool for monitor various aspects of your service. Last May Amazon introduced custom metrics to CloudWatch which allows sending any metrics data you wish to CloudWatch. You can then store it, plot it and also create CloudWatch Alerts based on it.
One of the things missing from CloudWatch is Simple Queue Service (SQS) monitoring, so I’ve written a small script to update a queue’s count in a CloudWatch custom metric.
Having the queue’s count in CloudWatch allows adding alerts and actions based on the queue’s length.
For example, if the queue’s length is above a certain amount of a certain period of time, one of 2 things happened:
- There is a bug in the code causing the worker processes that process the queue’s message to fail
- There is a higher than usual load on the system causing the queue fill up and get more and more messages while there aren’t enough worker processes to process these messages in reasonable time
If the load is higher than usual you can easily tell via a CloudWatch alert to add an additional machine instance running more worker processes or simply send an Email alert saying there is something wrong.
The script is very easy to use and can be run from a cron job. I’m running it as a cron job in 1 minute intervals and have set up various CloudWatch alerts to better monitor my queue.
Grab the script on Github at: SQS Cloud Watch Queue Count.