You're in Network Operations. The boss is on the phone, asking why the network isn't operating. Do you have the information to confidently give an answer? Or, like so many others in your situation, are you forced to admit you aren't sure?
“I don't know” is never an easy thing to say. But the complexity and scale of today's IT infrastructures create blind spots, knowledge gaps and delays that make answers hard to come by.
Modern networks no longer consist of just cables, boxes and a database. They're big collections of hardware and software – from smartphones to virtualized server farms – over which critical business services run. Any of these elements can crash, bog down, or mess up in countless ways. Network Operations staff work with the inevitability of those failings. They also live with the reality of having to say “I don't know” more than they'd like.
Today's performance monitoring questions are not so much new, as they are harder and more urgent than ever before. The three posed here, while generalized, are recurring, pressing questions based on the real experience of operations staff at some of the world's biggest companies, with the most demanding IT infrastructures.
In reading further, ask yourself: Do you have the answers? If not, are you able to get them? Do you have the data that makes answering them even possible?
Question 1. Where's the Problem?
You know what it's like: high-stress “war room” scenarios where a crowd tries to sift through monitoring tools with different interfaces to different systems in an attempt to isolate a problem as fast as possible.
Network Operations staff likely have a dozen or more monitoring tools. Most of which are vendor-specific to a given box or application or service. These tools may be excellent for monitoring that vendor's product, but often they don't play well with each other.
The ways these tools monitor infrastructure elements can add to the difficulties even further, by creating false alerts and unintended blind spots that hamper getting at the root cause. Furthermore, you often lack the data that can tell you whether your own service – firewalls, primary rate interface trunk capacity, Wi-Fi signal strength – is performing normally.
At the same time, there's more and more pressure to isolate problems faster. Is the login problem for end-users in the database infrastructure, a specific server, or somewhere else? Why is voice service at the call center choking when legacy tools show an average bandwidth consumption that seems acceptable?
In order to be able to confidently say “I know where the problem is” you need a new generation of infrastructure monitoring tools designed to deal with that complexity.
Infrastructure Performance Monitoring should be able to quickly pinpoint where problems originate.
In this screen, an Enterprise Wireless Status dashboard reveals if problems reside in the Controller, Access Points, or WLAN.
Question 2. What's Changed?
There are two types of changes affecting IT infrastructures: the changes you deliberately make, and the ones that just happen due to a glitch, a bug, an accident or a mistake. The rise of Software-Defined Networks (SDN) will, by definition and intent, make fast and frequent changes a feature of IT infrastructures.
When performance goes off track, people often don't realize the effect of even a single change. And back tracking that change is a tough job when you have to do it with multiple tools that don't work well together.
You might see, for example, that one application is suddenly consuming a much higher percentage of network bandwidth. You finally discover the problem lies on a Java VM, on a specific server VM, on a specific hardware server: a bit of application code was changed. But the configuration management and testing regimes didn't catch the increased risk to performance.
If legacy monitoring tools don't or can't baseline the performance of all the elements that comprise a given business service, it will take longer to recognize anomalies – the early warning system that can alert experienced Network Operations staff to a developing problem and relate it to something that has changed in the infrastructure.
Understanding what's “normal” takes time, even when your tools automatically baseline infrastructure elements. What's normal may change with time of day, day of the week or season of the year. For example, in a certain scenario a one-second response time might be considered normal. A sudden doubling or tripling would be out of the ordinary, but perfectly acceptable. Therefore, the definition of “abnormal” in this case might be “a sustained period beyond two to three seconds.”
Today's performance monitoring tools need an architecture as dynamic as today's IT infrastructures, so Network Operations staff get the insight they need to say “I know what's changed.”
To understand what's changed with your infrastructure, your performance monitoring platform should compare historical baselines to real-time performance of any metric you collect.
Question 3. How Do We Monitor Something This Big?
Legacy performance monitoring tools struggle to keep up with the massive scale of today's networks, which may be used by hundreds of thousands of users, or support millions of transactions, every day, and sometimes even every hour. This unprecedented scale – in both numbers and in complexity – is a prime cause of stress for Network Operations staff.
The limitations are painfully familiar. The tools bog down, to the point where you don't want to use them because they're so slow. You can only get a sampling of data – and it's outdated. Polling 30,000 interfaces to build a report seems endless. Your vendors take forever to certify a custom MIB.
If this sounds familiar, ask yourself: Are you running simulations of your infrastructure to characterize performance, when what you really want is current data about how the infrastructure elements are actually performing? Are you able to monitor only a subset of the infrastructure elements, and hoping that if there is a problem, it's in that subset? Do you have to reboot your monitoring system twice a week?
The next generation of performance monitoring tools must be designed for performance. They have to be able to handle much larger numbers so that Network Operations staff can say, “I can track something this big.”
Infrastructure Performance Monitoring - The Key To Finding Answers
Infrastructure performance monitoring tools are the key to answering the three questions posed above. Today's vendors, whether established legacy players or recently minted startups, are making big changes to address increased complexity and scale.
In different ways, and to different degrees, they're working to characterize data traffic, tag it and track it; plug into APIs, log data, Active Directory, DNS and other network services; store massive amounts of current and historical data; pool all the data relevant to a given business service; and create baselines for performance and behavior. Then they're directing this data into dashboardstyle visualizations, to give Network Operations staff an easy way to see what's actually going on. But beyond these general trends, to answer these three questions confidently, you need an infrastructure monitoring system designed for complexity and scale. So, before choosing one, ask yourself:
- Can it work across a multivendor infrastructure?
- Is it able to handle your range of protocols and standards?
- Can it reach into log data and correlate that data with other performance metrics?
- Will it efficiently access, store and process the mountains of data needed for accurate and fast reporting?
- Can it reliably baseline component and service performance and alert youwhen something has changed?
- Will it be able to help you map business services – identifying all components involved in a critical service, so they can be treated and tracked as a single system?
- Is the vendor able to certify custom MIBS expeditiously?
No one likes having to say “I don't know.” But the scale and complexity of today's IT infrastructures, and the limits of legacy performance monitors, make blind spots, knowledge gaps and delays inevitable. The right performance monitoring platform can give you the flexibility, speed, integration, scale and data that you need to confidently answer today's toughest questions.