Best Practices for Cutting MTTR in Half: Session 1


Join us as we discuss the best practices for cutting mean time to repair in half during this live demonstration. Downtime degrades the user experience and jeopardizes revenue, thereby causing widespread business effects and possible strategic failures. Learn specifically how SevOne not only reduces downtime, but additionally cuts repair costs


Okay, everyone. Good morning. Thank you for joining us for today's demo on Cutting MTTR in Half. Just a few housekeeping items. If you have any questions throughout the presentation, please chat them to me, SevOne Marketing, and I'll be sure work with Luke to make sure that your questions get answered. With that, I'm going to turn it over to Luke.

Hi, everybody. My name's Luke. Just a bit of an intro. This is a condensed series of "Demo with Dave" that we're running in the EMEA region and I'll hopefully try and take you through some of the great things that SevOne can do to help organizations out that need time to repair the MTTR. Without further ado, let's go through a few introduction slides and then I'll at some point, in about 5 or 10 minutes, get into a demo in a live platform just to make things a little bit more real.

There's a little introduction to SevOne for those who may never seen these things before. Who is SevOne? In a nutshell, we are the industry leading end-to-end performance management solution. All of the technology that you can think of in an idea for sure will be managed and delivered performance management around.

The solution is delivered in an appliance, so everything you need to deliver performance management capability comes in a single appliance. Everything from the collection capability to the day-to-day content and everything else, it all comes in a single box, so it's taken care of by SevOne.

One of the key differentiators from technical pass points of SevOne, is that the SevOne classification, this is where the NAP really taking SevOne to the next level.

In terms of being able to scale them to the large networks and never compromise the seat and perform with the balance solution and we leave with a performance management project essential to view of the customer is never compromised whether you have a environment of a hundred devices or a hundred thousand devices. There's several in cluster, as depicted here, is the reason why we never slow down.

I think you'll see evidence of that as we go through and the speed at which the data process is very, very possible

A recent survey, revealed that average reported incident in IT networking is eighty six minutes and with the various kind of cost calculations around the impact of service and resource time to it takes to fix those incidents and the down time with that is close to seven hundred thousand dollars. Now obviously, these numbers are alarming but I think everybody who is probably on this call has been in a situation where an incident has caused chaos within an operator's environment and it's with tools like SevOne that you can really start to get more dependability with the network and help to looking to relive these downtimes and ultimately the cost of the business.

SevOne to date. Some of the figures that we've got from our existing customers that the visibility and the speed that SevOne offers as reduced the need for repair by forty-seven percent. These figures would have come from service management platforms and incident management feeds within these organizations who put the timings around these things and calculate and contract these things.

There's me. You see me at a trade show or something you'll be able to recognize me in the future.

I've got two more slides, which we'll now kind of start to talk about where SevOne can add some value in this area. I thought it would be best to train these things in slides on the product just because then I can put the annotations around them and spend more time explaining them in a bit more detail.

The very first thing you do when you're looking at internet resolution in a network environment is how talented the detective really? Where does performance management comes in, where SevOne comes in. It's looking to be a little bit more proactive for conditions for events that might happen within your network and that's by looking at performance metrics over time and then triggering thresholds based on those metrics. Most performance management solutions out there will have this content on this slide now which is a fairly standard threshold, it's a simple threshold where somebody, an administrator in effect gave value as the comparison, so in this case, it's twenty percent over five minutes.

Now this is fine and it works. This kind of network works okay for a very simple like CPU but in actual terms this real alert here is not amazingly useful. The reason for that is because I have had to chose and say the twenty percent to be the good static threshold to set. What you get with SevOne is the ability to use compound ruling in the threshold definitions. I can have multiple conditions. You can see here I've mapped A, B, C, and D. All the way down to as many conditions as I like and then sorted them by all of it.

This gives me then the ability to only alert on real potential service impacting events. I can start to avoid alert storms and being overwhelmed by alerts

The second great thing that SevOne does is we baseline all of our performance metrics. The baseline right now is a separate calculation which we do after the fact, where there's no administration have had and this allows us to understand what normal behavior within your environment and it goes across every single performance counter that we collect.

As condition B shows in this slide it then gives us the ability use the learned analytics that we're applying to set your threshold alerts. So in this case I've got one that says alert me or condition B is true if an octet goes greater than three standard abbreviation above the baseline, above normal. We can see some interesting visualization for this one. Baselining is very, very powerful and the way we actually calculate the baseline at SevOne is by breaking up the data into fifteen minute chunks. Breaking it up puts all of your data points in that fifteen minutes and comparing it then to the same kind of thing on the same day of the week, previous week, going back x number of weeks, like with those multiples of ten and as you can kind of tell by that it gives us the visibility of a time of day to compare it with our baseline. We will know the difference between how busy your network is on a Sunday afternoon when most of your employees are at home enjoying their weekend as opposed to Monday morning at nine o'clock when everybody gets to the office at the beginning of the week and logs on. We want to contract that intelligently and then get a very good understanding of what we're going to do.

I'm going to go straight into looking at a report from SevOne. Before we go to the details of the report, it's always good to note how fast these reports render. On here you see we can mix and match different views of the data. The top one is an alerts view next to the graph and you'll see some other very good interactions in a second. This particular record is one that's been saved. I could create this on a regular basis to be sent by PDF, I could run it real time, I could have it on the network operations screen as my home page, my landing page for first line support or second line support or whatever. It's highly customizible by the user to a degree. Any number of reports can be graded in SevOne so you're not limited to just this which is a frequent.

As we go through, we have looking at here at that challenge and that firewall, so that part about our own SevOne network it's accessing the outside world of the internet. On this report it's got a number of graphs. The last one here is showing zero is a secondary file which is showing zero as I expect. These two graphs are the framework. As we were talking about baselines, you can see on these two graphs, we have a couple of views of the primary information. It's the same data but the one of the left, the larger one, shows the dotted line as the baseline.

This is what I'm talking about. Normally at this time of day at what is it? Five o'clock in the morning we would have about fifty meg as a normal behavior but in this case, this time of day, so last Friday or sorry, today actually, we see a spike has taken this up to nearly multiply that use. Now if I had a threshold alert set that was a concerning deviation from normal, I could set an alert to detect that but then it just means that we are able to tract what is normal and only alert you on abnormal behavior and that's where the proactive comes in.

The similar graph on the right hand side is a visual representation of standard deviations from baseline. The shaded area depicts the number of standard deviations away from our baseline so within those lips then perhaps there isn't anything to worry about. So lips around bases would be a number away from the baseline so x number of megabytes away from the baseline or whatever the unit from, a percentage of deviation from baseline or a number of standard deviations from baseline, the standard deviations from the statistical bell curve that we all see in a math lesson at some point in our life.

What I'm really able to do is only catch real critical or real potential service impacting events and not to give us alert storms. You know fifteen alerts when we only need one or when we can just ignore it because it's normal for that time of day.

Another thing you'll notice on this graphs, these actual integrators are being polled every sixty seconds. For those of you who are familiar with most tools the default would normally be five minutes. As it is at SevOne as well. I had to watch what SevOne can do over and above many other products on the market is actually called down to one second if required. This ability of high frequency or micro-polled we call it, means that we can get very granular visibility and spot micro-spikes much more efficiently than anybody else and again, even if we are potentially polling this every ten seconds the report renders the speed in which we process the data produce an alert, it's instant. It's always instant thanks to the applied from SevOne.

As we go down thinking about a real live workplace, I've mentioned baselines, I've mentioned alerting and I've mentioned how this all helps you be very proactive but in the real world for example I might have an alert which gets streamlined. In this case, I haven't created an alert for the sake of it but we did have an alert which triggered this spike. I could then drill down from the alert message which would be in window at the top down through to the line up which then gives me more specifics which time of day this happen so I can see quite clearly that this little chunk of today is in a little bit of a spike. Not concerning with that and I can chose a time period on my line graph and drill in. As you can see, everything happens very quickly indeed.

The next thing you want to know from a troll seeking perspective is okay I've got a spike, which is fine but what is making up the traffic in that spike? Why is that spike existing? Now with the click of a button and SevOne is very unique in the innovation so you can open up the network, I can do what I call report training which links the SNMP data graph to a net flow data graph and as you can see here I've gone from seeing total variation in that space to now understanding the top talkers is in that space and if by change, the net flow data to give these slightly more detailed information. I can go down to almost seeing the actual conversation that is going on behind this spike.

Here we see the red and the blue section are my big spike at that time of day and I can see very clearly the two culprits who biggest conversation on traffic pop back up. Not so one but this otherwise address. and then I'm putting in my score. That's a lot of data being grabbed by these two IP addresses and if this had triggered an alert and it was causing performance regulation, I would know straight away where to begin my evaluation and at this speed and this ability to abound and this ability to view all sorts of behavior from high level alert, which gets trigger all the way down to which conversation is impacting my network or which conversation is causing my incident that allows in the meantime to repair, numbers to come down and the time it takes to investigate.

From a quick sort of whistle stop demo, that's all I have to show around the reports. One thing I will just touch on is you'll see in a report like this or a dashboard which has one, two, three, four, five graphs in it, you can dress these up anyway you want. It's all there to open. Everything is drag and droppable, you can move them around and manipulate it around where you want to and save it for use later.

Just to kind of finish off talking about our reporting, another great thing at SevOne which is obviously very important to the end user is for how do you create one of these. One of these graphs? Does each graph have a little configuration type box that goes with it? Simple answer is no. Everything you saw on the previous page is driven through this very intuitive and easy to use wizard so I can choose any view of my data, or any of my devices or any of my server control, data center locations or applications to create one of these reports and because of the speed, it's very easy to either do them on adult basis, when you need, as you need or to create a few of these at standard reports. You can allow various different functions including organization.

Now the final thing I just want to say is this use case that I went through on this demo and this report, we found a spike and we can see which conversation is messing with the data. This is a very kind of real world example of something that can happen and I have customers, each day one in particular, that have almost this exact example where there was a data base backup happening every Sunday night at a particular time and the baseline calculation in SevOne, usually there was always a big traffic spike on Sunday morning at one o'clock in the morning.

What actually happened was one week, this data backup didn't happen. Whereas it didn't happen so this is customer data that's being backed up by this particular SevOne customer and because of the threshold alerts they set up. A baseline threshold said if I go above a certain baseline or below a baseline, then give me an alert. So two conditions that by an all statement.

In this case, the metrics dropped well below the baseline because their backup didn't happen so essentially it didn't have a spike and very quickly we're using SevOne and they noticed from this exact workflow, they figured out that actually we normally see this big spike and the baseline shows us this big spike but if it didn't happen we can surmise really quickly that the database backup didn't happen and that was hugely beneficial to them and got to them straightaway first thing Monday morning because they did not backup and then the got the database backup device fixed straight away before it became customer impacting. Again, need to repair but to be almost that little bit more proactive to detect something and fix something before it becomes an incident or before it becomes service impacting. Those kind of use cases and those kind of examples that SevOne really allows customers to be very aware of and have the visibility.

That's all I have in terms of our demonstration and if there are any questions then I will have a quick look and try and open the instant message chat window.

One question that has come in, is about the data points, so the question is looking at the sort of SNMP data, what is the range of the sort KPI's that we collect, which is a really good question actually so let me touch on that.

As you can see the example that I've shown you in this dashboard is a innovate utilization metric so we're looking at the firewall device for the internet and total bandwidth and actually this is gathered by SNMP and SevOne will actually collect and poll any metric that belongs to an object ID in SNMP. What is also key in SevOne is that all of the reporting is done on them all. If I refresh this report now, it might go back to it's original format.

Every time you run a graph or run a report, SevOne actually goes out and makes those on demand, the database queries to populate the report. As you can see the latest polled data is instantly available. When you see things in SevOne, collateral of our rebuilt Niagara, that's what we mean and so what it means is even though we're looking at total octets, this view, this graph is absolutely one hundred percent valid for any metric in the system, be it CPU utilization, be it memory, be it jitter, be it packet loss, be it disc utilization or anything like that. SevOne can have that coverage and it can produce that data instantly all of the time.

Take it even a step further, SevOne also has the ability to create what we call synthetic indicators or calculation objects and so by taking natively polled metrics like bandwidth utilization or disc utilization or whatever and begin able to then put some kind of mathematical formula around, it with maybe other metrics or just a formula we can then start to calculate track, graph alerts on custom KPI's that you want to be comfortable as well. For example, if you have a WAN pipe coming in from a local supplier which is variable in the amount of bandwidth that they provide to you and you want to know how that compares to how much you are using, you can create a calculation that says take the indicator from my incoming line bandwidth against what I'm using going out. Create me a percentage calculation and that can be my throughput percentage KPI if you like.

Once a calculation object like that has been created, it behaves in exactly the same way as it the recently polled indicators so we track it, we baseline it, we store twelve months of historic data on it and we ultimately allow you to alert as well as these kind of workplace on these custom KPI's as well. Very good question and maybe I slightly expanded on it further. I hope it makes sense.

Any other questions from anybody?

Okay, so there's one more question that I've got which talks about sort of visualizations. Okay. Two more questions. Alert visualizations is what I'll cover first.

As you can see in this particular report we've been touching on, we have what you'd expect, which is the standard alerts table. What we also have in SevOne is a bunch of other uses. Another useful one is this alert summary so this is here we can group things in terms of the service or location or something like that and give me the overall performance of that. So in this case, I'm looking at everything but as I draw out through everything and I can see this is red, I can then have a look more and more detail at which components are causing this group to go red. This everything group would be no service customer A for example and then it draws down through the leg. I can look in terms of a much higher level view of which performance metrics afford this damage or this potential impact.

That's another nice view and here we actually have a metric of meantime to repair and which just looking at the SevOne alerts and the average time between them being raised and them being cleared. That's a nice view of alerts. We also have a station view as well, so not something that I've prepared, but in the status month view we look at the same kind of thing where we have various different colors of nodes and links that make up our alerts view.

The last question I have is again, it's quite different actually because I think the contrast that I actually breezed over that very quickly. So this graph that I looked originally with the standard deviation from across, what does standard deviation mean to be x number of standard deviations away from the normal. Standard deviations is a statistical analysis formula or metric.

What it means is it looks at lots of data over time and it gives you what would be a reasonable upper and lower limit for a baseline. We have our baseline because we've always got that covered. If we say that my metric is within two standard deviations of the baseline, that means we are in within a reasonable limit above and below the baseline and there are actually six levels of standard deviation so one through six. Six being the widest margin if you like away from normal and one being the closest margin away from normal.

The reason that was written into product and the reason that is very useful is because it takes care of a very kind of noisy or changeable metric. If I have an interface that's constantly going up and down then the standard, where is the baseline, a normal percentage baseline not trigger you know fairly often by setting an upper and lower limit based on standard deviation, I am effectively cancelling out the noise above and below the baseline which is the shaded area here on the graph on top.

Hopefully that answered the question but really to cut a long story short, standard deviation helps you manage very noisy metrics or very noisy data that go up and down and are very, very changeable where the baseline might not always capture it, standard deviation will.

One more question that's come in, let me just read this in its entirety before I start to answer it.

This is a good question. The question about baselines and how we can calculate them without making new incidents. The default baseline calculation in SevOne is taking a fifteen minute average over ten weeks of history. We don't start alerting on that baseline until a minimum of ten weeks. It has to be calculated for that baseline. If you add a new metric tomorrow and you have a baseline calculation that defaults and will automatically apply to that new encounter, you won't actually get alerts on it until ten weeks has gone by because we don't think it's a reasonable baseline until that point.

We do have also in the product a baseline manager which allows you to reset baselines or on a schedule or either ad hock say you've been actually managed the baselines per interface or something and where that becomes useful is if you have a drone in space that's been alive but not been busy for four weeks, sorry ten weeks, then suddenly you're going to put two hundred users through that interface. At that point you might want to reset the baseline and start the calculation again because you think there's going to be significant change and you want to start from that point.

Essentially we don't use the baselines until there's enough historic data to make it a valid baseline and that ties into the minimum, the historic week setting which by default is ten. I hope that answered the question.

If there are any more? Doesn't look like there are in the chat.

I think that's it for questions for today. Thank you every one for joining us. Our next demo with Luke will be December 20th. You can go to our website find the registration page there so I hope everyone has a wonderful weekend and again, thank you for joining us and if you have any other additional questions, feel free to reach out to either Luke or myself separately after today. Thank you

Thank you everyone. Bye.