Image courtesy of Spirit-Fire on Flickr
I'd think I'd like to mirror a session title from the recent ThwackCamp and subtitle this particular post "Don't Hate Your Monitoring." We all face an enormous challenge in monitoring our systems and infrastructure, and in part that's caused by an underlying conflict:
Image Courtesy D Sharon Pruitt
This is a serious problem for everybody. We want to monitor everything we possibly can. We NEED to monitor everything we can, because heaven help us if we miss something important because we don't have the data available. At the same time, we cannot possibly cope with the volume of information coming into our monitoring systems; it's overwhelming, and trying to manually sift through to find the alerts or data that actually matter to your business. And then we wonder why people are stressed, and why we have a love/hate relationship with our monitoring systems!
How can the chaos be minimized? Well, some manual labor is required up front, and after that it will be an iterative process that's never truly complete.
Decide what actually needs to be monitored
It's tempting to monitor every port on every device, but do you really need to monitor every access switch port? Even if you want to maintain logs for those ports for other reasons, you'll want to filter alerts for those ports so that they don't show up in your day to day monitoring. If somebody complains about bad performance, then digging in to the monitoring and alerting is a good next step (maybe the port is fully utilized, or spewing errors constantly), but that's not business critical, perhaps unless that's your CEO's switchport.
Focus on which alerts you generate in the first place
- Use Custom Properties to allow identification of related systems so that alerts can be generated in an intelligent way using custom labels to identify related systems.
- Before diving into the Alert Suppression tab to keep things quiet, look carefully at Trigger Conditions and try to add intelligent queries in order to minimize the generation of alerts in the first place. The trigger conditions allow for some quite complex nested logic which can really help make sure that only the most critical alerts hit the top of your list.
- Use trigger conditions to suppress downstream alerts (e.g if a site router is down, don't trigger alerts from devices behind that router that are now inaccessible)
Suppress Alerts!
I know I just said not to dive into Alert Suppression, but it's still useful as the cherry on top of the cream that is carefully managed triggers.
- It's better in general to create appropriate rules governing when an alert is triggered than to suppress it afterwards. Alert suppression is in some ways rather a blunt tool; if the condition is true, all alerts are suppressed.
- One way to achieve downstream alert suppression is to add a suppression condition to devices on a given site that queries for the status of that site's edge router; if the router status is not "Up", the condition becomes true, and it should suppress the triggered alerts from that end device. This could also be achieved using Trigger Conditions, but it's cleaner in my opinion to do it in the Alert suppression tab. Note that I said "not Up" for the node status rather than "Down"; that means that the condition will evaluate to true for any status except Up, rather than explicitly requiring it to be only "Down". The more you know, etc...
Other features that may be helpful
- Use dependencies! Orion is smart enough to know the implicit dependencies of, say, CPU and Memory on the Host in which they are found, but site or application-level dependencies are just a little bit trickier for Orion to guess. The Dependencies feature allows you to create relationships between groups of devices so that if the 'parent' group is down, alerts from the 'child' group can be automatically suppressed. This is another way to achieve downstream alert suppression at a fairly granular level.
- Time-based monitoring may help for sites where the cleaner unplugs the server every night (or the system has a scheduled reboot), for example.
- Where approptiate, consider using the "Condition must exist for more than <x> minutes" option within Trigger Conditions to avoid getting an alert for every little blip in a system. This theoretically slows down your notification time, but can help clear out transient problems before they disturb you.
- Think carefully about where each alert type should be sent. Which ones are pager-worthy, for example, versus ones that should just be sent to a file for historical record keeping?
Performance and Capacity Monitoring
- Baselining. As I discussed in a previous post, if you don't know what the infrastructure is doing when things are working correctly, it makes it even harder to figure out what's wrong when then there's a problem. This might apply to element utilization, network routing issues, and more. This information doesn't have to be in your face all the time, but having it to hand is very valuable.
BUT!
Everything so far talks about how to handle alerting when events occur. This is "reactive" monitoring, and it's what most of us end up doing. However, to achieve true inner peace we need to look beyond the triggers and prevent the event from happening in the first place. Obviously there's not much that can be done about power outages or hardware failures, but in other ways we can help ourselves by proactively.
Proactive monitoring basically means preempting avoidable alerts. Solarwinds software offers a number of features to forecast and plan for capacity issues before they become alerts. For example, Virtualization Manager can warn of impending doom for VMs and their hosts; Storage Resource Monitor tracks capacity trends for storage devices; Network Performance Manager can forecast exhaustion dates on the network; User Device Tracker can monitor switch port utilization. Basically, we need to use the forecasting/trending tools provided to look for any measurement that looks like it's going to hit a threshold, check with the business to determine any additional growth expected, then make plans to mitigate the issue before it becomes one.
Hating Our Monitoring
We don't have to hate our monitoring. Sadly, the tools tend to do exactly what we tell them to, and we sometimes expect a little too much from them in terms of having the intelligence to know which alerts are important, and which are not. However, we have the technology at our fingertips, and we can make our infrastructure monitoring dance, if not to our tune (because sometimes we need something that just isn't possible at the moment), then at least to the same musical genre. With careful tuning, alerting can largely be mastered and minimized. With proactive monitoring and forecasting, we can avoid some of those alerts in the first place. After all -- and without wishing to sound too cheesy -- the best alert is the one that never triggers.