How to Improve Service Delivery with Integrated Metric, Flow, and Log Monitoring



Hi, I'm Steve Mahoney, and I manage SevOne's newest product line, our Performance Log Appliance. I want to spend a few minutes going over a fictitious example of how mature IT organizations are utilizing a variety of different technologies, tightly integrated together in order to better monitor their networking application infrastructure. I'm sure we are all familiar with a variety of different technologies such as, performance metrics from SNMP, IP SLA, ICMP, and flow technologies like Jflow, sFlow, it's Cisco NetFlow, and even log data. All of these things are super important, but when you tightly integrate them together, they can tell different facets of the same story, and help us reduce time to repair, identify anomalies and so on.

In this fictitious example I have here, I have a web application that my network is serving on. I have users remotely accessing it. They're coming in through the cloud, it looks like a freshly baked loaf of bread, and they are accessing our network equipment. We have a router behind the clouds, our F5 load balancer, and a pool of servers all delivering up the content. In this scenario we have the users, hundreds of thousands, and many of them are complaining about a degradation in performance. The question is, how do we identify what's going on, what's wrong in the scenario, and then how do we mitigate that issue.

Immediately we can go in and start looking at some simple performance statistics about the router and the load balancer and even the servers. We can look at things like CPU, like memory, like disc, we can trend those out over the period of time where the users have actually been making their complaints. This will only tell us a little bit about how the infrastructure itself is performing, and it's likely that if there was an issue of any kind we would have already got an alert. Another technology that we have, thanks to this F5 load balancer is sFlow. sFlow is a flow technology, and much like other flow technologies what it gives us is the ability to understand a little bit deeper about what traffic is actually being generated and what applications, types of servers and the direction that things are moving. In this instance we can actually see through a tightly integrated report that there's a number of different IP addresses or users, accessing a number of different ports in our pool of servers. They'll tell us the highest load, we'll get an idea of the top users and it will really get us an indicator of where we need to go next. That's as far as it's taken us so far, where we need to go next.

The flow data is really not going to tell us what is causing this problem, it's going to tell us the amount of traffic it's been utilized. Again, we might have found our issue because there could be some sort of congestion in the network, high utilization that we need to mitigate, but it's likely that that's okay, that's normal, and we need to dive a little bit deeper. The next level that we want to look into is this pool of servers here. Again with the standard statistics, we can gather some CPU, memory, we can understand the health and overall performance of the servers themselves, but with log data we can go a little bit deeper. In a brief moment, I'm going to show you a little bit more about the logs that we can gather from these servers that will help us understand the performance and really try and find the root cause of this degradation.

Now we want to take a little bit of a deeper look into the kinds of log data that we can extract from those web servers. We've already identified from performance statistics and from our flow data that it's not necessarily a network related issue right now that's causing our degradation. Instead what we want to do is look at some of the log data and see if there's anything about the application itself or the servers that indicates there might be a problem and then mitigate it from there.

In this example, let's imagine that we're dealing with an Apache server. Apache logs are extremely customizable. In this instance some fields that I really care about, would be the duration times. This is the duration by under which the servers are delivering content to the user. It's the entirety of that duration, so not just calculations or any graphs or things that might be generated, it's the actual generation and then delivery of that content. In our mod log config, we can actually customize the format with two different variables, %D and %T. These respectively indicate the number of seconds, and the number of microseconds that it takes to deliver a page. Microseconds for anyone who might not be familiar, is in between milliseconds and nanoseconds. milliseconds are 10 to the -3, microseconds are 10 to the -6 and nanoseconds are 10 to the -9, so you can see how they follow in the middle there.

Because we are trying to deliver our content as fast as possible, this is probably the number that we want to go with. Seconds are usually too slow and we generally don't get granularity out of that data. Once we have a solution that's actually extracting these fields, we don't necessarily want to look at the individual times of every single page or query or whatever it is that we're trying to deliver. Instead, we want to bucketize and categorize these durations into different segments that we can track, that we can create SLAs or SLOs around.

Once I've actually received my logs, I can actually send them into each one of these bucketizations. Maybe this bucket is sub-second, maybe this other bucket is 1 to 5 seconds, and finally 5+ seconds. The logs are going to tell us the different types of content that's been delivered. This could be queries, it could be web pages, it could be any kind of content that the web servers might have. We obviuosly don't want to see our content falling into the 5+ category. This is something that we want to be alerted to, but now when we peel it back, we can actually see that there might be some degradation at this point in the web application.

Here we understand now that the users who're accessing this web application and where the performance degradation might be. We have looked at the performing statistics of the router, the firewall, the servers underneath that pool. We understand statistics like CPU and memory. We've dove in to the sFlow and that has helped us understand our top users, the applications, the destinations within the pool, even the ports. We've actually looked at the logs now, and we can understand the content that's being delivered, how fast or how slow it's being delivered, and places here we might need to take mitigation actions.

SevOne does a really great job of integrating all of these different technologies together. It can help you reduce MTTR, it can help you gain significant visibility into all of these different pieces of your network infrastructure and your applications. Really, it's all about gaining the most value from all of these different technologies and you can do that if it's all in one platform.

If you'd like more information, you can visit today. Thanks.