7 Ways to Use Log Data for Proactive Performance Monitoring
Over the past decade, the value of log data for monitoring and diagnosing complex networks has become increasingly obvious. As a result, many operations teams have changed their IT practices. However, these advancements were hampered by the limitations of existing log search and analysis tools.
Fortunately, new technologies are now driving a more effective approach to using log data. Instead of gathering and examining information about the past, log analytics can now focus on the present and even the future. This white paper outlines seven examples of how this can be accomplished using the SevOne Performance Log Appliance (PLA).
SevOne PLA Successfully Addressing Today's Attitudes and Practices
Created for high-volume processing, storing and indexing of log data; the SevOne PLA does more than make log details accessible to complex, after-the-fact search queries. Algorithms identify patterns of log activity and create a picture of what’s “normal” behavior. When log entries vary from that baseline, the PLA sends an alert. Operations staff can then use the PLA web interface to drill down to the relevant logs to see what changed and why. SevOne PLA also interfaces with the SevOne performance monitoring platform. For the first time, operations staff can automatically correlate polled performance metrics on networks, servers, applications, storage and more with the corresponding log details on device or application actions and changes in state.
The need for a solution like SevOne PLA is clear. The attitudes and practices of network operations groups have been evolving for years. In fact, 59 percent of IT respondents in a December 2014 report by Enterprise Management Associates said they consider log analytics a “strategic,” not merely “tactical,” effort. These strategic users were two times more likely than tactical users to say that log data is the “most important” of all network management data sources. Large enterprise organizations, in particular, cited log data as “the first place we turn to” when dealing with infrastructure monitoring issues.
The evolution of IT attitudes and practices is reflected in improved log analysis tools, especially search tools with sophisticated proprietary query languages, and more muscular analytics applications. Yet, two major problems remain with even the best of these tools: they are after-thefact responses to network issues; and users have to know what they’re looking for in order to frame the search queries. These limitations, coupled with the sheer volume of log data and log-based devices, mean it typically takes hours to sift through log data to identify the causes of infrastructure problems.
The SevOne Performance Log Appliance (PLA) provides exactly these capabilities. It also links tightly with the SevOne infrastructure monitoring platform. This allows users to highlight performance anomalies and, with one mouse click, bring up the corresponding PLA log data in the same window.
Fortunately, a combination of new technologies can overcome these limitations:
- A scalable architecture using distributed, parallel processing to handle the data volumes
- Algorithms to analyze the normal behavior (baseline) of devices and applications
- An alerting feature that detects variations in the baseline, or first occurrences of unique logs and sends warnings to operations staff
- A user interface that uses identified key performance metrics and indexing to simplify navigating log messages, without the need to learn complex query languages
- Automatic correlation of performance metric changes with the related log data, for faster discovery of root causes.
Here are seven ways to use this powerful new approach to log analytics:
1. Leveraging New Log Fields
One online shopping site wanted to track how long it took for shoppers’ search queries to return results. Minimizing that time boosts the number of transactions and customer satisfaction, and improves the site’s scoring by Google Analytics. The first step was to add a field to the Apache custom log format, for “duration.” This allowed the log to track how long it took for the application server to respond to a web server query and enabled the operations team to see, at any given moment, the duration of specific queries.
Next, they took this new log entry further with the SevOne PLA, which analyzes log data to identify patterns and creates a picture of what constitutes normal behavior (in this case, search queries). This allowed them to categorize variations from this baseline into several groups, including response times under 5 seconds, under 30 seconds, more than 60 second, etc. From here, they used the PLA to create alerts triggered by these variations.
The results were immediate. The operations team quickly tracked down a group of long-running search queries and re-coded them for faster responses. Moving forward they plan to correlate the log data with performance metrics from server CPUs, memory and disk storage with the goal of re-configuring these resources to optimize search times.
2. Receiving Automatic Alerts for First-Time Log Events
Being able to get an automatic alert each time a never-before-seen log message code appears can be a huge advantage. Often, these messages act like a “canary in the coalmine,” warning of a change that could be the forerunner of a big problem. This capability is called “first value occurrence” in the SevOne PLA.
One example of the importance of first value occurrence is its ability to illuminate obscure or rare log messages in a population of hundreds or thousands of Cisco routers and switches. Each of these devices are capable of sending hundreds of unique message types. It’s impractical to create alerts for each of these because the resulting flood of logbased alerts would be more confusing than illuminating. Furthermore, because many of them are rare, they aren’t the kind of data that are routinely and regularly searched for in log analytics tools.
Because the PLA builds an accurate picture of normal behavior, it knows when a previously unknown router message appears, and sends an alert. By identifying a first-time message code like “low memory resource,” the PLA proactively tells operations teams there is an anomaly. This kind of information is actionable because it tells the team, right away, what changed and where. That change can then be correlated with other log messages and with performance metrics.
3. Monitoring Spikes and Drops in Application Message Volumes
Most applications and devices have a regular pattern in the number of log messages they send during a given amount of time. A firewall might generate a thousand messages per second; a router might generate a thousand per day. Spikes or drops in these patterns can reveal underlying problems. For example, one operations group configured a backup that ended up going through a firewall instead of remaining, as intended, on the LAN. The number of firewall connections soared and the firewall crashed because it wasn’t configured for that many connections.
The SevOne PLA can baseline these message flows, and instantly alert when behavior changes from the norm. For example, after a software upgrade, a server cluster’s logs showed a jump in API error messages, in one instance jumping from 100 messages per second to over 2,000. The PLA detected the spike and sent an alert. Operations staff called up the relevant log data in the PLA user interface, and saw that certain software processes had been thrown out of alignment due to a simple misconfiguration in the upgrade.
In addition, with the SevOne PLA, it’s possible to be even more finegrained by identifying and tracking individual fields within the log messages. So within a surge of messages, users can discover, for example, that a given SNMP daemon is spiking because of too-frequent polling. This level of detail could be buried in the volume of other messages and difficult to find without the PLA alerting capability.
4. Knowing When Log Data Dries Up
One of the simplest and most basic questions in log analytics is the one that, until now, was very difficult to answer: when does a device suddenly stop sending log data, and therefore, stop providing current information about its state?
Most log tools today, including sophisticated search products, can’t answer this question because they don’t provide a baseline of what is normal. And IT staff can only find out by manually, continuously and laboriously checking each device.
With baselining and alerting features, the SevOne PLA treats this problem as just another message volume alert. It identifies the baseline message volume it receives over a given time from a specific device. If the volume drops by some number or drops to zero, the PLA sends an alert. Critical devices and servers can no longer fall silent about their state without being noticed.
5. Monitoring VoIP Call Quality
Understanding VOIP telephony environment and call quality scores is not typically seen as a use case for log data. However, some operations groups are already exploring such applications.
One organization was using a popular log search tool to monitor the call logs of their Cisco call managers. The search tool let them create a report that used various log statistics and data -- such as mean opinion scores, R Values (a score designed to express the subjective quality of speech) and jitter (the variation in the delay of received packets) -- to understand the quality of VOIP calls at a given moment or span of time.
But that isolated result didn’t show the trend of quality issues over time, because the tool lacked a way to baseline the log data from the call managers. The only way to do that was to manually and tediously compare separate call quality log reports of comparable time periods from day to day, or week to week.
The SevOne PLA is able to analyze the influx of data to automatically identify the relevant baseline behaviors. Using these baselines, the PLA can be used to alert operations teams to things like a gradual increase in jitter, or recurring jitter spikes at certain times or on certain days. Log data that previously could only be used as a forensic tool to analyze quality data at an isolated moment is becoming an operational tool for monitoring that data in real-time.
Coupling this event-based data with performance metrics - such as CPU and memory utilization, and drive space availability - creates a picture of resource dependencies, and guides actions to optimize these resources and preserve quality objectives.
6. Managing Configuration and Policy Changes
On a platform like the SevOne PLA, log data can be used to manage not only unexpected changes, but also the wealth of deliberately planned changes to infrastructure: a bug fix, an OS update, a server upgrade, a new forwarding rule or policy change, a new application or a cluster reconfiguration, for example. In this way, log data becomes a way to measure and manage these changes; to confirm that a change has achieved its performance objectives; or to identify how a change may have triggered a cascade of unexpected issues.
The SevOne PLA also enables operations staff to create a fine-grained before-and-after comparison of log events when launching a network change. They can capture when the configuration change occurs in different network devices, and even see what specific changes have occurred. For example, adding a new QOS scheme and changing out a given access list can each be separately captured, seen and analyzed.
The impact of this data is even greater when it’s coupled with the SevOne performance monitoring platform, which polls devices and applications via an array of protocols, to collect and baseline performance statistics such as network or CPU utilization. These metrics create a view of the overall health of the network or a service, before and after a planned change. Correlating these metrics with the associated PLA log data shows how, and how well, the change is improving or degrading overall performance.
7. Using Log Data as a Basis for Capacity Planning
By combining log data with performance metrics, users can accurately forecast future growth in network activity and usage. Those projections can then become the basis for network changes and upgrades to handle that growth.
Log data makes it possible to capture user activity at a granular level. Using this information, the SevOne PLA can baseline behaviors for the average number of users, or for peak number of users. Performance metrics then reveal how much of the network resources and processes are associated with each. This combination of capabilities makes it possible to do things like “deconstruct” online shopping activities once users press the “checkout” button. Operations teams can see how long the transaction takes and measure that against the backend CPU load and other metrics. By bringing together detailed, event-based log data with overall performance metrics, they can also see the “weight” a single customer puts on the infrastructure, and on the overall health of an application.
Operations staff can then project future demand and load based on adoption rate and the increase in user numbers. These projections are no longer based on best guesses, but on the measureable historical trend of real users who are creating the actual load on the network infrastructure. The result is greater accuracy, cost-effectiveness and confidence in capacity planning decisions.
A New Era in Leveraging Log Analytics
The above examples embody a new approach that recognizes log data as a valuable real-time resource for network operations. Event-based logs have immediacy and detail about activities; they disclose state changes; and they capture the history of these records. All of these qualities are needed for proactive management of today’s complex IT infrastructures.
Until now, these qualities could only be partially exploited. The SevOne PLA brings a scalable architecture to store, index, analyze and baseline vast amounts of log data. As a result, operations staff no longer have to search for problems: an alerting engine detects variances and anomalies and sends a warning. In addition, it links with the SevOne infrastructure monitoring platform to illuminate changes in performance metrics with the corresponding eventlevel log data.