As a longtime IT professional, responding to problems in systems of one
type or another is old hat. It comes with the job description, and one
tends to develop habits and methodologies early on in your career. That
doesn't mean that I, or anyone else, have developed the best habits,
however, and often our methods are quite ineffective indeed.
Standard practice in the industry is for a network operations center (NOC)
to monitor some portion of the network for immediate or impending troubles.
Companies spend millions of dollars on entire rooms filled with
beautiful monitors mounted on walls, desks and workstations built to look
as futuristic as possible, low lights at just the right hue, and
comprehensive monitoring suites to keep track of it all. The trouble is,
this often makes people aware of a problem, but offers nothing in the way
of a troubleshooting methodology or tool to actually fix the problem.
As often as not, when some sort of event worth responding to grabs the
attention of a NOC engineer, they either call someone or start a trouble
ticket, or both. The lucky recipient of the aforementioned prodding then
digs into the problem or passes it onto the next person in the chain,
with each successive person having to start over in their own domain
(compute, network, security, etc.) with new tools and limited information.
This entire approach may seem logical and even expedient, though I suspect
that's largely due to a little bit of Stockholm Syndrome and the ever
popular "but this is the way we've always done it" argument. I'm not
saying that this is a bad approach--or at least that it hasn't always been
a bad approach--given the historical dearth of cross-silo troubleshooting
tools available on the market. Most of us instinctively knew that this was
inefficient, but didn't have a good sense of what we could do about it.
Various tools and paradigms were suggested, developed, sold, and
subsequently put on shelves that attempted to fix the full-stack
troubleshooting void. Comprehensive network tools are one of the
favorites, offering a truly staggering array of dashboards, widgets,
alerts, and beautiful graphics in a noble attempt to present the most
information possible to the engineers tasked with fixing the relevant
problems. Many tools also exist for doing the same thing inside of virtual
environments, or on storage arrays, or the cloud, etc., and many are very
good at what they do... but they don't do what we need, which is to
collapse the silos between IT disciplines into one, unified, system - until
now.
Solarwinds NPM, part of the Orion suite of products, has long been the
darling of NOCs everywhere, and with good reason. It is a comprehensive
and well thought out approach to network and systems monitoring.
Collapsing the silos in IT, however, requires more than just a great tool
for the NOC, or even a great tool for the network and systems teams. It
requires a tool which is not only useful for all of these teams, but
preserves the chain of data (of troubleshooting) as it moves between
specialties. In other words, if I'm the systems guy, I want to see the
data that the network team is seeing, and the steps they've taken to
resolve the problem; and I want to see it in the system, not a hastily or
poorly-crafted email, which is the equivalent of tossing a flaming bag of
excrement over the wall on our way out.
NPM 12.1 has taken a stab--a good stab--at solving these problems with the
inclusion of a tool called PerfStack. I'll be exploring what this tool can
do, and where in the troubleshooting process it fits, in a series of blog
posts over the coming weeks. I'll likely also toss in some of my own
personal horror stories of troubleshooting problems, as I've had more than my
fair share in my past, and confession is cathartic. In the meantime,
I'd encourage everyone to check out this already fantastic series of posts
on the new tool:
https://thwack.solarwinds.com/community/solarwinds-community/product-blog/blog