Tag Archives: thoughts

My ideal monitoring system

The server / application monitoring field is filled with lots of options these days. The solutions vary greatly with feature set and management in mind.

There are various parameters distinguishing between these systems:

  • Hosted (CloudKick, ServerDensity, CloudWatch, RevelCloud and others) vs Installed (NagiosMuninGangliaCacti)

    Mission Control by Wade Harpootlian
  • Hosted solutions pricing plans use varied parameters such as price/server, price/metric, retention policy, # of metrics tracked, realtime-ness, etc.
  • Poll based method – where collecting server polls the other servers/service vs. Push – where you have a client on the server that pushes locally collected data to logging/monitoring server
  • Allowing custom metrics – not all systems allows monitoring, plotting, sending and alert on custom data (at least not in a easy manner)
Some of these systems are better suited to some tasks more than the others but in general none of them provides a (good) solution for handling todays monitoring needs that spans from operational to applicative.

My ideal monitoring system for any application that have servers running in the background should have the following features:

  • Hosted – for when it doesn’t make sense for me to run the operations of this
  • Open Source – for when the sweet spot leans towards me taking control of the operations and management of collecting my own statistics with a CLEAR path of migration between the hosted solution and my installed one
  • Suitable for a cloud / virtual server environment – where servers go up and down and each server simply reports its data to the monitoring system without the need to pre-register it with the system. This suggests a small client running on each machine collecting local stats and relaying it to a central system
  • Supports custom metrics – allowing me to add whatever stats I want be it operational (CPU, network bandwidth, disk space) or application related (such as number of sign ups or a specific function run time in milliseconds)
  • Understand more than numbers – not all stats are equal. Some are counters which I just want to say “increment” or “decrement”. Others are single data points that I need to simply store. Others are special data points with a unit of measure (such as “milliseconds”, “Bytes/second”, etc)
  • Locally installed client must handle network failures – if there is a network failure or a collecting server down time, stats will be stored locally and relayed to the collecting server when its available again
  • Locally installed client should collect custom metrics – if I want to send some custom metrics from my app – say when a user signs up – my code would talk with the locally installed client and that client will relay the data to the collecting server. This ensures minimum configuration and my app code can assume that there is always a locally installed client which can communicate with the collecting server be it via UNIX sockets, UDP datagram, shared memory or anything else that is suitable for the job
  • Data should be query-able – that is, I really want to query and filter more than just the timeframes of the data and general statistics on it (i.e. group by server, show specific servers, show values higher than X, etc)
  • Reporting Console – somewhere to plot all these statistics which has embeddable graphs (for those who likes building their own dashboards)
  • Built-in near real-time alerts – I want to be able to set alerts that go out near real time when collecting the data to a diverse set of outlets be it Email, Text Messages, Push Notifications, WebHook (for automating some auto handling of failures or problems), etc.
  • API – because everything needs it 🙂

It is very important to me in almost any hosted SaaS (Software-as-a-Service) solution I use that I will have a clear migration path if (or when) the time comes and I need to host a certain sub-system on my own. Sometimes I do have to compromise and use a system that I may not have the ability to migrate (or at least not easily) but the decision is made consciously.

From an architecture point of view, I would like to see these main building blocks:

  • Storage – Reliable, scalable, can handle lots of writes fast and handle any size of dataset for any reasonable retention period
  • Collectors – clients push data to these collectors which gets it and pass it on the processors
  • Processors – Handle incoming data to be written. Aggregate data for quicker reporting.
  • Reporting – something that will enable easy querying and filtering of the data
  • Real time alerts monitoring – handle preconfigured alerts and figuring in near real time if certain conditions are met to issue the relevant alerts/actions
  • Web Console – for configuration, querying and real-time plotting of data
  • API for querying
  • API for real time plotting – to be used for integration with other apps, embeddable chunks of code, etc.

While I’m sure with a little more thought more requirements can be added or some of these requirements can be merged and minimized, this set of features will create a system a lot of people would love to use and feel comfortable using.

Would you use such a system? Do you have anything else to add to the feature set?

AWS Elastic (accidental) Load Balancer Man-in-the-middle Attack

I just read a post on Slashdot about a poor guy getting a huge chunk of Netflix traffic to his server.

The problem seemed to have been caused by the nature of IP address in EC2 which are quite fluid and gets reassigned when you spin up and down a machine. The same goes for Elastic Load Balancers (ELB) which are managed by Amazon and may switch the IP address as well (that’s why they ask to map to their CNAME record for the ELB instead of the IP).

In the Slashdot post, there is a link to this article, which describes the problem and lists some possible implications and possible ways of avoiding leaking data such as passwords and session ids when such a problem occurs.

The article mostly talks about what happend if someone hijacks your ELB, but the original problem reported was that you accidentally got someone elses traffic. This can lead to some other severe consequences:

  • Your servers crashing (in which case you should probably notice that rather quickly. Duh!)
  • If you are running some kind of a content site that depends on SEO and crawlers picked on the wrong IP, you might end up with a HUGE SEO penalty because another site’s content will be crawled on your domain
There is a very simple and quick solution for the problem I am describing above. Make sure you configure your web server to answer only to YOUR host names. Your servers will return response ONLY for a preconfigured set of hostnames, so if you get Netflix traffic, which probably has netflix.com hostname, your server will reject it immediately.

You can easily configured that in Nginx, Apache or if you have a caching proxy such as Varnish or squid.

A better solution for this problem is to add hostname checks support to ELB itself. I’ve posted a feature request on the AWS EC2 forum with the hopes that it will get implemented.