The server / application monitoring field is filled with lots of options these days. The solutions vary greatly with feature set and management in mind.
There are various parameters distinguishing between these systems:
- Hosted (CloudKick, ServerDensity, CloudWatch, RevelCloud and others) vs Installed (Nagios, Munin, Ganglia, Cacti)
- Hosted solutions pricing plans use varied parameters such as price/server, price/metric, retention policy, # of metrics tracked, realtime-ness, etc.
- Poll based method – where collecting server polls the other servers/service vs. Push – where you have a client on the server that pushes locally collected data to logging/monitoring server
- Allowing custom metrics – not all systems allows monitoring, plotting, sending and alert on custom data (at least not in a easy manner)
My ideal monitoring system for any application that have servers running in the background should have the following features:
- Hosted – for when it doesn’t make sense for me to run the operations of this
- Open Source – for when the sweet spot leans towards me taking control of the operations and management of collecting my own statistics with a CLEAR path of migration between the hosted solution and my installed one
- Suitable for a cloud / virtual server environment – where servers go up and down and each server simply reports its data to the monitoring system without the need to pre-register it with the system. This suggests a small client running on each machine collecting local stats and relaying it to a central system
- Supports custom metrics – allowing me to add whatever stats I want be it operational (CPU, network bandwidth, disk space) or application related (such as number of sign ups or a specific function run time in milliseconds)
- Understand more than numbers – not all stats are equal. Some are counters which I just want to say “increment” or “decrement”. Others are single data points that I need to simply store. Others are special data points with a unit of measure (such as “milliseconds”, “Bytes/second”, etc)
- Locally installed client must handle network failures – if there is a network failure or a collecting server down time, stats will be stored locally and relayed to the collecting server when its available again
- Locally installed client should collect custom metrics – if I want to send some custom metrics from my app – say when a user signs up – my code would talk with the locally installed client and that client will relay the data to the collecting server. This ensures minimum configuration and my app code can assume that there is always a locally installed client which can communicate with the collecting server be it via UNIX sockets, UDP datagram, shared memory or anything else that is suitable for the job
- Data should be query-able – that is, I really want to query and filter more than just the timeframes of the data and general statistics on it (i.e. group by server, show specific servers, show values higher than X, etc)
- Reporting Console – somewhere to plot all these statistics which has embeddable graphs (for those who likes building their own dashboards)
- Built-in near real-time alerts – I want to be able to set alerts that go out near real time when collecting the data to a diverse set of outlets be it Email, Text Messages, Push Notifications, WebHook (for automating some auto handling of failures or problems), etc.
- API – because everything needs it
It is very important to me in almost any hosted SaaS (Software-as-a-Service) solution I use that I will have a clear migration path if (or when) the time comes and I need to host a certain sub-system on my own. Sometimes I do have to compromise and use a system that I may not have the ability to migrate (or at least not easily) but the decision is made consciously.
From an architecture point of view, I would like to see these main building blocks:
- Storage – Reliable, scalable, can handle lots of writes fast and handle any size of dataset for any reasonable retention period
- Collectors – clients push data to these collectors which gets it and pass it on the processors
- Processors – Handle incoming data to be written. Aggregate data for quicker reporting.
- Reporting – something that will enable easy querying and filtering of the data
- Real time alerts monitoring – handle preconfigured alerts and figuring in near real time if certain conditions are met to issue the relevant alerts/actions
- Web Console – for configuration, querying and real-time plotting of data
- API for querying
- API for real time plotting – to be used for integration with other apps, embeddable chunks of code, etc.
While I’m sure with a little more thought more requirements can be added or some of these requirements can be merged and minimized, this set of features will create a system a lot of people would love to use and feel comfortable using.
Would you use such a system? Do you have anything else to add to the feature set?