Blog, Events & Press

17 Feb

Why Performance Alerts Require More Intelligence

SevOne graph depicting baselines and thresholds

When it comes to performance monitoring tools, we hear the same complaint over and over from operations teams: “We get too many false positive alerts.”

The primary culprit? A process that generates alerts based off static, user-defined threshold violations. Don’t get me wrong – there are times when static thresholds serve a purpose. For example, wanting to know when a CPU exceeds 95% utilization for a period of 15 minutes or more, or when the voltage on a UPS falls outside of a specified range. But generally speaking, static, user-defined thresholds suffer many issues:

  • They are often “best guesses” at what acceptable performance should be
  • They do not understand the context of your unique infrastructure
  • They lack the intelligence to know if a violation is significant or not

When it comes to alerts, what you really want to understand is, “What’s happening in my environment right now that is unique that I need to know about?” Static thresholds fail at providing such insight.

For example, how do you determine acceptable upper- and lower-limit threshold values for the following?

  • The number of connections to your firewall for any time of day
  • Number of failed user logins for a specific application over a 15 minute period
  • Number of Apache processes spawned by an application

Maybe you simply want to understand when bandwidth utilization is much higher (users streaming YouTube and Netflix) or much lower (network backup failed) than normal. Again, static thresholds come up short.

Too Much Noise

Let’s consider the example below. A monitored metric has exceeded its pre-determined, static threshold value seven times over a 30 minute period. This results in seven redundant alerts, contributing to the noise that you have to wade through.

SevOne Alerting Functionality

Too Wrong for Too Long

Now let’s consider a more intelligent approach to this scenario. Let’s assume your monitoring platform automatically baselines the performance of every metric it collects. This provides a reference point for any given time of day and day of the week.

If we compare real-time performance to historical norms for this moment, we see that the first four spikes are within an acceptable range. Perhaps one spike exceeded a standard deviation from normal performance, but not for a prolonged period. No need to alert just yet.

But as we move further along the timeline, we notice something relevant. The monitored metric exceeds an allowable deviation from baseline performance four times within a sliding 15 minute window. Now the monitoring system knows there is an anomaly for sure and it alerts once.

SevOne Alerting Functionality

This method provides a more reliable predictor of service-impacting events. There’s no noise to sort through. No false positive alerts to wake you up. Instead, you get insight about real-time changes within your infrastructure performance.

Looking for more common sense tips for monitoring the performance of your infrastructure? Download our free whitepaper on 6 Steps to an Effective Performance Monitoring Strategy.

Written by Scott Frymire
Director of Content Marketing

Scott Frymire joined SevOne in September 2012 and currently serves as Director of Content Marketing. His primary interest is interpreting how IT trends in the enterprise and service provider markets – such as cloud, software-defined everything, and the Internet of Things – impact the performance monitoring landscape. Prior to SevOne, Scott spent 16 years in marketing business-to-business software and services for ERP solution providers including Prophet 21, Activant, and Epicor.

Subscribe To Our Blog