The Maturity Model for Infrastructure Monitoring
The Maturity Model for Infrastructure Monitoring traces a path through five levels, each of which adds functionalities and capabilities that improve, streamline, automate, reduce risk and lower cost. The details of each level describe the impact on IT staff and end users as an organization matures from ad-hoc monitoring and resource availability to the ultimate goal of optimized service delivery.
Infrastructure Monitoring - A New Model for a More Complex Environment
Monitoring today’s IT infrastructures has become so difficult that most organizations only detect poor performance when something goes wrong. The reason for this challenge lies in the complexity of modern applications and networks, which are often the result of expansions that occur over time to cope with growth and advances in technology. These infrastructures contain both physical and virtual components from multiple vendors, usually in numerous locations, including both private and public clouds, and operating on a variety of systems and platforms.
To make sense of infrastructure performance monitoring in this complex environment, it helps to break down individual tasks into distinct goals, functionalities and capabilities. The Maturity Model for Infrastructure Monitoring describes the stages of controlled monitoring required to track, report, react to and resolve infrastructure performance elements comprehensively, regardless of the complexity of the network.
The Maturity Model helps:
- Reduce risk by closing visibility gaps
- Assist IT staff so they can become more efficient and eliminate human error
- Decrease CAPEX and OPEX costs
- Lower the impact of performance issues on customers
- Reduce customer churn
- Provide a better handle on controlling infrastructure performance
This white paper provides an overview of the Maturity Model for Infrastructure Monitoring, and details the five levels and the advantages of moving through them. It also points out the drawbacks and dangers of leaving performance monitoring to basic tools. Lastly, it offers ways to reach a state of optimized service delivery.
The Maturity Model's Five Levels
Starting with the ad hoc performance monitoring tools in Level One and the basic availability tools in Level Two, the Maturity Model describes the critical advances that come with the sophisticated standardization and consolidation found in Level Three, the advanced visibility that comes with Level Four, and the final optimized service delivery that results from the monitoring platform in Level Five.
What’s at stake for an organization if it doesn’t take steps to move to Level Five? Customer experience, application performance and capacity planning, among other things. IT staff workloads will inevitably grow more onerous, and problems from human error will remain on the rise. Expansion and innovation will suffer. And, as a result of all these unaddressed issues, IT costs will escalate across the board.
The Maturity Model provides insight into how to gain control over all aspects of a network while reducing both risk and costs. For example, achieving Level Five, as shown in Figure 1, will result in significant cost savings in CAPEX and OPEX due to automation and the consolidation of tools into a platform that addresses 80% or more of the monitoring needs, thereby eliminating several redundant maintenance contracts. Risk also diminishes as visibility gaps are closed and reliable multivariate analytics are added. Returns and revenue may increase as well if the savings are used to fund innovation and new initiatives. And finally, there will be significant savings in IT staff time, which studies have shown leads to improved employee satisfaction.
Level 1. Ad Hoc Monitoring
Hardware vendors often supply monitoring tools for their products. Unfortunately, these tools have limited functionality, and provide little in the way of effective performance monitoring. Instead, they frequently result in significant application and service disruptions because they fail to account for the interaction of the product with other components on the network. Additionally, their lack of insight into effects on the overall infrastructure makes capacity forecasting impossible.
In a typical Level One scenario, less than 20% of the infrastructure is visible, and the focus of staff is on partial coverage of the critical application delivery. Views are 5 minute snapshots, masking activity spikes that occur at sub-minute intervals. Alerts occur only at upper limit thresholds, and false positives generate a lot of noise. Ad hoc reports are run only after performance events have happened, and vendors’ canned reports are limited in scope and restrict understanding of what’s going on. There is no service awareness and almost no automation, and zero confidence that monitoring tools can be scaled to cover a larger infrastructure.
At this level IT staff are operating blind, completing everyday tasks at a slow pace, and dealing with significant, unplanned downtime and capacity issues. They find themselves frequently troubleshooting in the dark. Inefficiencies are extremely costly, and IT functioning level is chaotic at best.
Three ways to move to Level Two:
- Ditch the hardware vendor tools in favor of solutions that function in multi-vendor environments
- Ratchet up polling to one-minute intervals for more granular views of infrastructure performance
- Define the components of services that need monitoring
Level 2. Basic Availability
Adding tools to fill in the gaps caused by inadequate hardware vendor tools, while a well-intended fix, actually results in more drastic problems and decreased visibility. Demands on IT staff grow because they now have to monitor more input sources. And, costs often remain high and may increase as even more new tools are purchased. This type of reactive firefighting with multiple small hoses hooked to an array of disparate monitoring tools can lead to a lot of smoke, while unresolved problems continue to smolder.
Infrastructure visibility may now reach as high as 40%, but that still leaves the majority of IT in the dark. Polling may increase from 5 minute cycles in Level One to a single minute when required, but performance data are averaged over time, resulting in poor capacity planning data and a lack of historical reporting granularity. Too many false positive alerts and swivel-chair troubleshooting across disparate tools still plague staff and consume far too much of their time. Dashboards bring together different components of service-related performance reporting, but offer no true correlation. Overlapping and incomplete tools require costly and redundant maintenance contracts, and agent-based monitoring adds to the administrative burden and limits scalability.
Here in Level Two, IT staff is mainly reactive. They’re constantly putting out fires rather than detecting sparks, still at the mercy of limited, purpose-built tools. For end users, service is unreliable, causing a high rate of customer churn. Staff are overworked, and job satisfaction is low. Innovation and new initiatives are distant dreams.
Three ways to move to Level Three:
- Baseline all metrics and trigger alerts when there’s a deviation from normal performance
- Correlate performance metrics with flow data to better understand consumption of resources
- Crank up interoperability and automation by integrating with help
At this stage, there’s end-to-end visualization of network, compute and storage by business unit or customer, and it’s possible to view both physical and virtual resources on one screen. Reports can be customized on the fly, because they derive from a real-time, single source of truth. A central, scalable monitoring platform addresses 80% or more of monitoring needs, with point solutions for specific services. Though you will always have point solutions for specific monitoring needs, the majority of infrastructure monitoring at this stage is done without the need for agents or probes, significantly decreasing administrative burden. Finally, since there is now integration with help desk solutions such as ServiceNow, SalesForce, and ZenDesk, seamless transfer of information between platforms can occur, resulting in faster issue resolution.
However, there’s still room for improvement. For forecast needs and capacity planning, staff continues to gather data from a number of sources and manually enter them into spreadsheets. The ability to scale to current monitoring demands has vastly improved, but at a significant price tag because of investments in hardware like high-end servers, pollers, data collectors and centralized database infrastructure.
Level Three also requires baselines for every metric collected. This provides an accurate view of what’s “normal” at any given time. When performance deviates from historical norms, an alert is sent. Understanding change in this way is a key component of Level Three because often these changes are not only a symptom of problems; they’re a direct or indirect cause as well.
In Level Three, conditions across the infrastructure are normalized. Staff is more comfortable and in greater control since they can see more than half of the infrastructure at any given time. Future-proofing is in place, and a measurable reduction in costs has begun to take place.
Three ways to move to Level Four:
- Incorporate visibility of applications and service delivery as opposed to monitoring only individual infrastructure components
- Link your alerts to log analysis to spot unique logs or trending conditions
- Define, monitor and alert on custom KPIs that don’t exist in the MIB of monitored devices
Level 4. Advanced Visbility
At Level Four, service-level views and cross-platform processes ensure that reliable metrics are the basis of business decision-making. And, mean time to repair (MTTR) is reduced significantly, resulting in fewer staff-hours devoted to troubleshooting and issue resolution.
Here, 80% to 90% of the infrastructure is visible, including application and service delivery instead of just component monitoring. Automated discovery of L2 and L3 topology is available, and it’s possible to view real-time status and SLA instrumentation, including packet loss, jitter and congestion. Log analysis now triggers alerts, working with accurate baselines and thresholds based on each unique environment. Single clicks get staff from metric to flow to logs within the same interface, greatly facilitating troubleshooting and reducing MTTR. In fact, the monitoring platform makes it possible to resolve half of all issues proactively before they produce any discernible impact. Organizations know what’s happening on the network, where it’s happening, and when it’s happening -- end to end.
At this level capacity planning and trending can be performed from a single platform. For example, using reports like “days until threshold” and log data analysis, staff can anticipate how user behavior on individual applications will impact capacity needs of the underlying infrastructure. They can then make necessary adjustments to avoid any user impact. These proactive capacity planning insights can be especially helpful when rolling out new applications or services.
These reports reliably support business decisions, offering insights based on KPIs defined by the organization. All time series data can be ingested– regardless of source – and seamlessly graphed with other metrics, such as SNMP and IP SLA. For example, an organization could correlate footfall traffic to demand on a wireless network, or correlate transaction volume to the stress it places on the underlying infrastructure.
At Level Four, it’s possible to view all object metrics down to one-second granularity, with zero degradation to the speed of reports, no matter the size of the monitored domain. Ingestion of daily log volumes greater than a terabyte is possible, with flows-per-second in the hundreds of thousands. Without the need for human intervention, the platform allows new devices to be added to the configuration management database and integrated with data center orchestration and tools such as Ansible, Puppet and Chef.
IT strategy and operation has been streamlined and is now proactive. Interoperability means greatly reduced MTTR, and cost savings are dramatically evident. The effects are now being felt by customers and IT staff alike. Customers are seeing consistently reliable service, and employees are experiencing the relief that comes with responsible automation. The result is a positive impact on overall business. But there’s still one more threshold to cross.
Three ways to move to Level Five:
Tie alerts to multi-variate analysis to spot trouble due to multiple, related events Collect sub-second views of infrastructure performance from probe-based solutions and report on these metrics from the monitoring platform Incorporate service-centric status maps to create awareness of all the components required to deliver the service successfully
Level 5. Optimized Service Delivery
Level Five is the ultimate goal in the Maturity Model for Infrastructure Performance Monitoring . At this stage an effective platform incorporates extensive automated functions and multivariate analytics. And, there’s full understanding and control of the entire infrastructure, end to end, including hybrid cloud elements and all the on- and off-premises components that make up the network.
With comprehensive automation and reliable real-time analytics, infrastructure performance undergoes continuous improvement. Optimal performance drives innovation and creative expansion, and with less of their attention obligated to mundane monitoring tasks, IT staff have more time to spend on innovation, which often leads to greater job satisfaction.
Visibility is now at maximum -- a full 99%. Even “shadow IT” is no longer inthe dark, so there’s awareness of everything impacting the infrastructure resources. Organizations have insight into how environmental, transaction volume and energy consumption impact the underlying infrastructure. For example, the platform monitors the temperature inside and outside a datacenter, noting any differential in trending conditions as a possible indication of an issue. The platform can even monitor power strips to detect inefficient servers that draw more energy than is necessary or normal.
At this level, probe-based solutions deliver sub-second performance views, including agent-based end-user experience metrics, and correlate them with infrastructure performance. Comparative analysis and multivariate metrics and analytics rule, providing increased forecasting accuracy. The completely virtualized, all-in-one monitoring platform also has the ability to spin up new monitoring capacity on demand and as needed.
The platform detects and rectifies 80% of performance issues prior to any significant end user impact, thanks primarily to service-centric status maps that reveal all the components involved in successfully delivering the service. Software defined network (SDN) controllers subscribe to the performance monitoring platform in order to receive recommendations for optimizing the performance of the virtual infrastructure. Cloud, virtualization and an all-IP connected environment means it’s possible to intuitively scale with massive data collection at will. Adding monitoring capacity is as simple as spinning up a new VM on demand.
At this stage, organizations have nearly fully automated the monitoring of their infrastructure performance, resulting in an unprecedented level of confidence. With a renewed sense of job satisfaction, staff can now spend time fine-tuning the particulars, and are free to create and pursue continuous improvement of the application and service delivery. Having maximized the value of the monitoring platform — and saved considerable CAPEX and OPEX funds in doing so — it’s now possible to explore new revenue streams through savings-funded innovation.
Determining Levels on the Maturity Model
So, how can organizations find out what level they’re at? And what do they need to do to get to Level Five?
First, they must recognize that they might be far along the model in some areas but lagging behind in others. For example, a datacenter might be able to collect time series data at an advanced level, but when an issue arises, staff may be swivel-chair troubleshooting for hours, even days, with a variety of disparate vendor tools. A good motivation for moving an area up from the lower level is that the non-optimal area may be negatively influencing areas already farther up the model.
This quick Online Assessment helps organizations find their level on the Maturity Model. The information gained from this assessment provides a sense of what’s already being done well and helps identify what can be improved.
Choosing a Compreshensive Performance Monitor
When looking for a performance monitoring platform that will move an organization up the levels on the Maturity Model, it’s important to find one with functionalities and capabilities that cover all five levels. Often, simple monitoring solutions will only cover certain aspects of the first couple levels.
Staff must also examine all aspects of infrastructure monitoring -- the inputs and outputs, as well as the various platforms in the network, including SDNs and any specialty applications unique to the organization.
When moving through the levels, it’s important to take full advantage of the various functionalities and capabilities of whatever performance monitoring system is implemented. Moving up one, or even two levels, can take as little as a couple of weeks. During this process, organizations can implement functionalities on their own or take advantage of technical help from system specialists.
Currently, no single solution covers everything needed to reach the top of the Maturity Model. However, SevOne’s comprehensive infrastructure monitoring platform comes close -- very close. By providing a fully automated, comprehensive solution, it’s helping some of today’s largest, most connected companies gain the security, cost savings and peace of mind that comes with approaching Level Five.