My ideal monitoring system

The server / application monitoring field is filled with lots of options these days. The solutions vary greatly with feature set and management in mind.

There are various parameters distinguishing between these systems:

  • Hosted (CloudKick, ServerDensity, CloudWatch, RevelCloud and others) vs Installed (NagiosMuninGangliaCacti)

    Mission Control by Wade Harpootlian

  • Hosted solutions pricing plans use varied parameters such as price/server, price/metric, retention policy, # of metrics tracked, realtime-ness, etc.
  • Poll based method – where collecting server polls the other servers/service vs. Push – where you have a client on the server that pushes locally collected data to logging/monitoring server
  • Allowing custom metrics – not all systems allows monitoring, plotting, sending and alert on custom data (at least not in a easy manner)
Some of these systems are better suited to some tasks more than the others but in general none of them provides a (good) solution for handling todays monitoring needs that spans from operational to applicative.

My ideal monitoring system for any application that have servers running in the background should have the following features:

  • Hosted – for when it doesn’t make sense for me to run the operations of this
  • Open Source – for when the sweet spot leans towards me taking control of the operations and management of collecting my own statistics with a CLEAR path of migration between the hosted solution and my installed one
  • Suitable for a cloud / virtual server environment – where servers go up and down and each server simply reports its data to the monitoring system without the need to pre-register it with the system. This suggests a small client running on each machine collecting local stats and relaying it to a central system
  • Supports custom metrics – allowing me to add whatever stats I want be it operational (CPU, network bandwidth, disk space) or application related (such as number of sign ups or a specific function run time in milliseconds)
  • Understand more than numbers – not all stats are equal. Some are counters which I just want to say “increment” or “decrement”. Others are single data points that I need to simply store. Others are special data points with a unit of measure (such as “milliseconds”, “Bytes/second”, etc)
  • Locally installed client must handle network failures – if there is a network failure or a collecting server down time, stats will be stored locally and relayed to the collecting server when its available again
  • Locally installed client should collect custom metrics – if I want to send some custom metrics from my app – say when a user signs up – my code would talk with the locally installed client and that client will relay the data to the collecting server. This ensures minimum configuration and my app code can assume that there is always a locally installed client which can communicate with the collecting server be it via UNIX sockets, UDP datagram, shared memory or anything else that is suitable for the job
  • Data should be query-able – that is, I really want to query and filter more than just the timeframes of the data and general statistics on it (i.e. group by server, show specific servers, show values higher than X, etc)
  • Reporting Console – somewhere to plot all these statistics which has embeddable graphs (for those who likes building their own dashboards)
  • Built-in near real-time alerts – I want to be able to set alerts that go out near real time when collecting the data to a diverse set of outlets be it Email, Text Messages, Push Notifications, WebHook (for automating some auto handling of failures or problems), etc.
  • API – because everything needs it :-)

It is very important to me in almost any hosted SaaS (Software-as-a-Service) solution I use that I will have a clear migration path if (or when) the time comes and I need to host a certain sub-system on my own. Sometimes I do have to compromise and use a system that I may not have the ability to migrate (or at least not easily) but the decision is made consciously.

From an architecture point of view, I would like to see these main building blocks:

  • Storage – Reliable, scalable, can handle lots of writes fast and handle any size of dataset for any reasonable retention period
  • Collectors – clients push data to these collectors which gets it and pass it on the processors
  • Processors – Handle incoming data to be written. Aggregate data for quicker reporting.
  • Reporting – something that will enable easy querying and filtering of the data
  • Real time alerts monitoring – handle preconfigured alerts and figuring in near real time if certain conditions are met to issue the relevant alerts/actions
  • Web Console – for configuration, querying and real-time plotting of data
  • API for querying
  • API for real time plotting – to be used for integration with other apps, embeddable chunks of code, etc.

While I’m sure with a little more thought more requirements can be added or some of these requirements can be merged and minimized, this set of features will create a system a lot of people would love to use and feel comfortable using.

Would you use such a system? Do you have anything else to add to the feature set?

  • Chung

    Interesting post. I was a product manager with a management tool vendor so I had been thinking about this subject before. Some of the capabilities that you mentioned, such as custom metrics or the ability for the locally installed client (agent?) to handle network failure with both custom and built-in metric collection, exist out there. Are they not in the tools that you use? What tools do you use?

    • http://eran.sandler.co.il/ Eran Sandler

      My main problem is that there are multiple tools that do various jobs, some of which are overlapping.

      Almost ALL tools are not meant to work in a cloud environment well where servers go up and down all the time.

      Some tools require specifically adding a server to be monitoring (like Cacti), others can handle server pushing data, but then its only partial and doesn’t include the rest of the metrics I’d like to track.

      There are a few hosted tools like CloudKick and ServerDensity which are working well for the cloud and can handle default metrics like CPU, disk space, etc as well as custom metric but then they don’t have a migration path for when I can handle the work of running my own tools and not using a hosted solution.

  • http://www.facebook.com/profile.php?id=609015520 Andrew McGrath

    Not bad, but how do you handle notifications? A lot of these systems only output emails at best. You should checkout http://www.verelo.com we use them

  • http://twitter.com/rberger Robert J. Berger

    Check out Sensu from Sonian. Its looking like a nice framework to hang off many of these ideas. https://github.com/sonian/sensu

    • http://eran.sandler.co.il/ Eran Sandler

      Thanks. It looks interesting, however the correct way it stores this data on REDIS makes its a bit less easy to query the data in whatever way I would wish. It’s similar to how Graphite is VERY good at storing time series and plotting it, however there comes a time when you want the raw data to do whatever analysis you want.