Zen and the Art of Network Management: Achieving Inner Peace

Image courtesy of Spirit-Fire on Flickr

I'd think I'd like to mirror a session title from the recent ThwackCamp and subtitle this particular post "Don't Hate Your Monitoring." We all face an enormous challenge in monitoring our systems and infrastructure, and in part that's caused by an underlying conflict:

Image Courtesy D Sharon Pruitt

This is a serious problem for everybody. We want to monitor everything we possibly can. We NEED to monitor everything we can, because heaven help us if we miss something important because we don't have the data available. At the same time, we cannot possibly cope with the volume of information coming into our monitoring systems; it's overwhelming, and trying to manually sift through to find the alerts or data that actually matter to your business. And then we wonder why people are stressed, and why we have a love/hate relationship with our monitoring systems!

How can the chaos be minimized? Well, some manual labor is required up front, and after that it will be an iterative process that's never truly complete.

Decide what actually needs to be monitored

It's tempting to monitor every port on every device, but do you really need to monitor every access switch port? Even if you want to maintain logs for those ports for other reasons, you'll want to filter alerts for those ports so that they don't show up in your day to day monitoring. If somebody complains about bad performance, then digging in to the monitoring and alerting is a good next step (maybe the port is fully utilized, or spewing errors constantly), but that's not business critical, perhaps unless that's your CEO's switchport.

Focus on which alerts you generate in the first place

Use Custom Properties to allow identification of related systems so that alerts can be generated in an intelligent way using custom labels to identify related systems.
Before diving into the Alert Suppression tab to keep things quiet, look carefully at Trigger Conditions and try to add intelligent queries in order to minimize the generation of alerts in the first place. The trigger conditions allow for some quite complex nested logic which can really help make sure that only the most critical alerts hit the top of your list.
Use trigger conditions to suppress downstream alerts (e.g if a site router is down, don't trigger alerts from devices behind that router that are now inaccessible)

Suppress Alerts!

I know I just said not to dive into Alert Suppression, but it's still useful as the cherry on top of the cream that is carefully managed triggers.

It's better in general to create appropriate rules governing when an alert is triggered than to suppress it afterwards. Alert suppression is in some ways rather a blunt tool; if the condition is true, all alerts are suppressed.
One way to achieve downstream alert suppression is to add a suppression condition to devices on a given site that queries for the status of that site's edge router; if the router status is not "Up", the condition becomes true, and it should suppress the triggered alerts from that end device. This could also be achieved using Trigger Conditions, but it's cleaner in my opinion to do it in the Alert suppression tab. Note that I said "not Up" for the node status rather than "Down"; that means that the condition will evaluate to true for any status except Up, rather than explicitly requiring it to be only "Down". The more you know, etc...

Other features that may be helpful

Use dependencies! Orion is smart enough to know the implicit dependencies of, say, CPU and Memory on the Host in which they are found, but site or application-level dependencies are just a little bit trickier for Orion to guess. The Dependencies feature allows you to create relationships between groups of devices so that if the 'parent' group is down, alerts from the 'child' group can be automatically suppressed. This is another way to achieve downstream alert suppression at a fairly granular level.
Time-based monitoring may help for sites where the cleaner unplugs the server every night (or the system has a scheduled reboot), for example.
Where approptiate, consider using the "Condition must exist for more than <x> minutes" option within Trigger Conditions to avoid getting an alert for every little blip in a system. This theoretically slows down your notification time, but can help clear out transient problems before they disturb you.
Think carefully about where each alert type should be sent. Which ones are pager-worthy, for example, versus ones that should just be sent to a file for historical record keeping?

Performance and Capacity Monitoring

Baselining. As I discussed in a previous post, if you don't know what the infrastructure is doing when things are working correctly, it makes it even harder to figure out what's wrong when then there's a problem. This might apply to element utilization, network routing issues, and more. This information doesn't have to be in your face all the time, but having it to hand is very valuable.

BUT!

Everything so far talks about how to handle alerting when events occur. This is "reactive" monitoring, and it's what most of us end up doing. However, to achieve true inner peace we need to look beyond the triggers and prevent the event from happening in the first place. Obviously there's not much that can be done about power outages or hardware failures, but in other ways we can help ourselves by proactively.

Proactive monitoring basically means preempting avoidable alerts. Solarwinds software offers a number of features to forecast and plan for capacity issues before they become alerts. For example, Virtualization Manager can warn of impending doom for VMs and their hosts; Storage Resource Monitor tracks capacity trends for storage devices; Network Performance Manager can forecast exhaustion dates on the network; User Device Tracker can monitor switch port utilization. Basically, we need to use the forecasting/trending tools provided to look for any measurement that looks like it's going to hit a threshold, check with the business to determine any additional growth expected, then make plans to mitigate the issue before it becomes one.

Hating Our Monitoring

We don't have to hate our monitoring. Sadly, the tools tend to do exactly what we tell them to, and we sometimes expect a little too much from them in terms of having the intelligence to know which alerts are important, and which are not. However, we have the technology at our fingertips, and we can make our infrastructure monitoring dance, if not to our tune (because sometimes we need something that just isn't possible at the moment), then at least to the same musical genre. With careful tuning, alerting can largely be mastered and minimized. With proactive monitoring and forecasting, we can avoid some of those alerts in the first place. After all -- and without wishing to sound too cheesy -- the best alert is the one that never triggers.

Zen and the Art of Network Management: Achieving Inner Peace

Decide what actually needs to be monitored

Focus on which alerts you generate in the first place

Suppress Alerts!

Other features that may be helpful

Performance and Capacity Monitoring

BUT!

Hating Our Monitoring

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112