Whether you’re looking to improve your application and service delivery, consolidate existing performance monitoring tools and responsibilities or justify the impact of a new technology deployment, the following six steps can help you create the fundamental building blocks of an effective performance monitoring strategy:
Any network performance management tool can draw you a pretty graph. What’s required is pulling together a comprehensive strategy that looks at what data you’re gathering and how that data will be treated during each of these six steps.
Breaking down your strategy into these components will make it easier to understand, articulate and reach consensus on the network and infrastructure performance monitoring requirements for your business. This whitepaper also highlights the questions you should ask performance monitoring vendors to ensure they support your strategy.
Any performance monitoring strategy starts with data collection. If you can’t monitor it, you can’t manage it. It sounds simple enough, but there are many obstacles that prevent proper data collection and leave you with visibility gaps. To prevent this, look for a network and infrastructure monitoring platform that supports the following data collection criteria:
100% of the Data
Your network performance monitoring platform should be data agnostic. It should be able to ingest any and all series data, regardless of source. This includes support for collecting data via standards including: SNMP, NetFlow, IP SLA, WMI, JMX, NBAR, and more. But your platform should also support where the market is headed. Next-generation networks, both in the WAN and the datacenter, have begun to move away from suppoting standard protocols like SNMP, in favor of vendor-specific APIs as well as network streaming telemetry.
High frequency polling
Traditional five minute polling cycles are insufficient in many instances. A spike in traffic that lasts only a couple seconds represents less than 1% of a five minute polling cycle. The anomaly will be completely flattened and undetectable when averaged out over that time span. Yet this brief spike can disrupt business transactions, video communications, and other latency-sensitive applications. Many leading customers today gather data at least every minute. Be sure your monitoring platform strategy enables you to poll for performance data at the frequency your business requires. Don’t get trapped into a strategy that limits what you collect and how often based on the polling limits of an inferior monitoring platform.
Raw data retention
Granular data collection is only useful when you can maintain that data for a sufficient time frame. Some monitoring solutions may average and consolidate historical data over time for storage reasons. This results in a lesser understanding of historical events. It also weakens your ability to forecast future capacity needs with accuracy. Many organizations seek a performance monitoring platform that maintains a year of as-polled data. You should not have to invest in extra storage capacity to do this.
We live in the age of Big Data. Applications, systems and network devices produce massive volumes of performance data. Your performance monitoring platform must scale with your data collection needs. Inability to scale your solution forces you to make tough decisions about what you will and won’t monitor. This creates a visibility gap. You never know what might go wrong in your environment. So taking a broad approach to data collection supported by a highly scalable platform is the best strategy.
Once you’ve collected the broadest set of performance data at the required granularity, it’s time to establish a baseline for “normal”
performance. Your network and infrastructure performance monitoring platform should do this automatically for every metric you collect.
Baselines provide a historical reference point for every 15 minute time frame, every day of the week. With baselines, you can compare real-time network and infrastructure performance to historical norms. You’re also able to view capacity trends and deviations that cause performance-impacting events.
Virtualized data centers with rapid elasticity make baseline technology that much more important. Users may provision and retire resources commensurate with demand and without human intervention. This rapid fluctuation in your environment presents monitoring challenges. It’s imperative to understand what normal looks like at any given moment. Baselines then become your basis for a more effective alerting method.
You may employ two types of alerts: those based on static thresholds and those based on deviation from baseline performance.
Static thresholds are useful in cases such as wanting to know when a CPU exceeds 95% utilization for a period of 15 minutes or more. Some performance monitoring tools may only allow you to set an upper-level threshold. You should look for a solution that allows you to specify both an upper- and lower-level range. For example, you’d want to know when the voltage on a UPS falls outside of a specified range. Or when the temperature of a device exceeds a high or low manufacturer recommendation.
But you may not always understand what an acceptable range is for performance of a specific device or metric. Often, administrators guess at threshold values. This results in a significant increase in noise from false positive alerts.
In an environment that generates a lot of noise, it’s more effective to alert when performance of any metric deviates from historical norms. For example, if your company always runs a backup procedure at 3:00 a.m., you don’t want a daily alert about high bandwidth usage. But you would want an alert when an unexpected spike occurs during working hours due to a unique user-initiated action. You should be able to specify how many standard deviations you consider acceptable for any metric. This requires an understanding of baseline historical performance for all metrics monitored.
This method provides a more reliable predictor of service-impacting events. It answers the most critical question: “What is happening in my environment right now that is unique that I need to know about?”
Collecting network and infrastructure performance data is hard. Reporting on it should be the easy part. When developing your network and infrastructure performance monitoring strategy, verify that your vendors can answer “yes” to all of the following questions.
Many vendors deliver canned reports that reveal most utilized interfaces, highest packet loss and other key metrics. While these reports are helpful, they may not allow for the level of manipulation required. Troubleshooting the more challenging performance scenarios demands extreme report flexibility.
Be sure to ask the following questions of any
- Can you experience immediate value from out of the box, day one reporting, with a series of auto-populating and fully editable reports of customer metric and flow data for common network performance reporting needs?
- Can you easily edit any of the out-of-the-box reports, or create new reports from scratch - in both dark and light mode—then reuse them for different regions/ offices, saving hours of report building?
- Can you reuse and share any report to enable consistent insights across
multi-site operations teams?
- Can you quickly pivot and easily visualize related metric, flow and alert data through chaining visualizations together, such that when selecting data in one visualization drives the content of the rest of the dashboard?
- Can you create troubleshooting workflows by linking multiple reports together - and then share those workflows across teams?
Finally, you need to understand how increases to the number of devices and the amount of metric and flow data you are collecting and analyzing impacts the speed of your reporting platform. Reports that fail at providing near real-time information are unacceptable. Performance monitoring solutions that rely on a centralized database architecture suffer significant degradation to reporting speed as your monitored domain expands. The centralized reporting database becomes a bottleneck, unable to process report requests with speed. That is why reports sometimes take minutes or hours rather than seconds. It is best to maintain information in a distributed fashion and have the system query the data when needed.
Data analysis and visualization may take a number of unique forms. At the end of the day, you’re looking for actionable insight that allows you to:
- Detect and avoid performance events before they impact end
users or customers
- Fine tune your infrastructure to make the most of current resources
- Make more informed decisions about the impact the infrastructure
has on your business
Shift to Proactive
Proactive analysis and troubleshooting means you have a better chance of avoiding service-impacting events. For example, in Step 3 we reviewed the value of receiving alerts based on performance deviation (as opposed to static thresholds). But what if you could receive a notification when a key WAN link has hit a bursting threshold multiple times over an hour? This type of automated analysis helps you stay a step ahead of your end users. Look for ways to make the shift from reactive troubleshooting to proactive analysis that helps you avoid performance events in the first place.
Many organizations struggle with performance analysis because their platform fails to provide dashboards and reports that can present disparate data sources in a single view. This makes correlation problematic. You end up referencing multiple screens in an attempt to understand causation. Plan a strategy that allows you to automatically pivot from traditional performance metrics – such as SNMP or IP SLA – to insightful flow data.
Capacity teams expect data analysis to reveal which resources are near exhaustion and how much time they have to complete a resource upgrade. How much bandwidth does your business need? How close to maximum utilization are your servers? Which network interfaces will be most utilized 30 days from now?
The ultimate tool for WAN and Data Center capacity teams is a report that reveals the number of days until specific resources reach a user-defined threshold. For example, a capacity planner would like to see all resources that will exceed 80% utilization in 30 days or less. This gives them the ability to plan upgrades based on an understanding of the lapsed time required to complete the upgrade.
Remember, when it comes to capacity planning, volume isn’t everything. You must also comprehend the composition of that traffic. Understanding the type of activity taking place can make a big difference in investment plans and monetization strategy. Capacity planning is a numbers game. But the best projection models take into account the value of different types of traffic.
You now have a strategy to collect, baseline, alert, report and analyze your network and infrastructure performance data. But who can benefit from your insight? Your audience needs may vary:
• Customers through a secure self-service portal
• Co-workers with a .pdf report
• Capacity teams with a .csv export
• Management and executives via real-time dashboards
Providing data for the sake of data is not a productive strategy. The most important lesson to keep in mind when it comes to sharing performance data is know your audience. A CTO does not want to get bogged down in granular metrics about a particular link’s capacity or some other device status. They’re much more interested in a service-level or even market-level view of performance. For example, “What percent of time are we within SLA compliance?” or “How do my multiple carriers impact performance of my top applications?” Often, a dashboard featuring a simple status map with red/yellow/green indicators for current service health suffices.
Sharing information also means sharing data with other platforms, such as a fault or configuration management solutions. It should be just as easy to export data as it is to ingest it. Does your vendor supply an API for communicating with other platforms as part of your maintenance plan, or is that an additional expense?
The journey from raw performance monitoring data to actionable insight about the health of your infrastructure is a critical path. The more you can refine the process to eliminate wasted time and energy, the more successful you’ll be. We often think of this cycle as Mean Time to Repair (MTTR), but it can be more accurately stated as Mean Time to Action. Ultimately, you’re looking to make
decisions on the best possible data in the shortest amount of time. This requires a clear understanding of the steps in the process and how you can shave critical moments off each stage.
Breaking down your performance monitoring process into these six fundamental components provides clarity around your strategy. It allows you to answer questions like:
- “How much and what types of data should I prepare to collect?”
- “Do I understand what normal performance looks like in my environment?”
- “How can I see when something is happening right now that I need
to know about?”
- “Can I understand exactly what happened at any time in the past year?”
- “Am I able to comprehend the future capacity needs of my organization?”
- “Who needs this information and how can I make it actionable for them?”
By understanding the core requirements of a monitoring strategy, you also arm yourself with the knowledge to make an informed buying decision when evaluating performance monitoring vendors.