Detecting Issues with Load Balancers


This SevOne video tutorial is centered around the discussion and usage of load balancers. Gain an inside edge by learning about standard deployments, useful metrics, and troubleshooting details related to load balancers.


All right. Hopefully you are able to see my screen. This has got a nice little dashboard up that we've done specifically for the F5 System. I wanted to start off with a brief introduction of myself. My name is Bill, I'm going to be essentially your Dave for the day. Dave is in fact out of town and he asked to step in and give this presentation on his behalf.

Today what we are going to be talking about are load balancers and basically what Said One can do to help you get better visibility into whether or not they're doing their job properly. So, we're going to talk a little bit about some common deployments, useful metrics that we're pulling out of the system, ways to present that information, and then finally we're going to get into some troubleshooting tidbits that may be useful to you.

The first thing I want to talk about is the standard deployment for these load balancers. Our example for the day is from F5 which is one of the major brands of load balancers today. But essentially, most of them are going to work the same way. So, the way that they work is essentially you set up either a virtual IP Address or a virtual IP address plus port number that is exposed to the outside world and then internally the load balancer does a fairly good job based on choices of algorithms of distributing the load that comes in on that virtual IP address virtual port number and mapping that to physical devices that are actually behind it. So the idea is that it wants to keep the load roughly even, hence the name load balancing, between these physical devices through the virtual IP address.

And the interesting tidbit with that, it usually comes with the deployment, is reaching virtual IP addresses and the physical, the real, IP addresses, doesn't usually happen from the same spot in the network. There's usually a layer of firewalls that are in between the load balancers and the actual equipment that basically allow only specific traffic through. So, normally we would get the chance to either see the virtual address or the real addresses, but not both of them. So, typically these virtual addresses are organized into pools. So, the pool is essentially a collection of real devices that will respond in a predictable way when messages are passed through. The load balancer, in many cases, does active checks to make sure that those participants are still there and responding on a regular basis.

It will actually remove participants from that pool any time that there is an outage with one of those members. So, it automatically will remove and add members of the pool depending on the needs of that particular situation. And so you can look at the built in tasks, usually it's something simple like a TCP connectivity check or ICMP only to the systems to ensure that items are being removed and added from that pool as needed.

Let's talk about some of the basic metrics that we're presenting on the screen. If you look here, basically one of the things that we're highlighting is the connections per second. What we're looking at is essentially this is the virtual circuit that's presented. So what we've done is we've set up a couple of our own SevOne appliances behind this virtual F5 and essentially it's balancing HTTP and HTTPS. All right, so port 80 and port 443, right. From this case what we're looking at essentially is the number of connections that it's having at any given time and the number of connections per second. So this is kind of an interesting metric, right? Usually we can see, and again this is for the virtual address, so we're seeing eventually the common connections that are coming through regardless of how they're being balanced. So, this tells us some interesting tidbits. It tells us that for the most part it's fairly static, except we had some interesting drops in the traffic right around 8:30pm last night.

Now I believe in looking at this data that the reason that that happened is we actually put the load balancers under maintenance. Or actually not the load balancers but the systems behind it under maintenance. So, that would actually account for the availability drop that's occurring at that moment in time. So similarly we can see essentially interesting tidbits here like essentially the time for connections, the time for DNS lookups, the time for downloading. All of this is really useful metrics to basically give a good indication of how this appliance is actually working and the traffic patterns that are occurring. So, we can see as an example, if we start to see that the number of connections per second is dropping and we see that some of the other metrics over here related to the time that it takes to connect are actually going up, we may realize that there is a problem with the appliance or we can actually look a little bit deeper and see if there's actually anything behind it that's causing problems as well.

So, let's look for a second at this virtual server in and out metrics. You can see that basically there's input statistics. We're showing basically the amount of traffic that's flowing here and you can see that at that same time where we had a number of connections drop off we also had a similar drop in the input and the output metric. And this is not unexpected, but for the most part this seems to be doing a fairly good job of balancing. The traffic pattern is fairly consistent, we notice that for the most part it's staying roughly about the same amount in and out. So, whoever is requesting the traffic is actually doing a fairly predictable pattern here.

I wanted to show one other piece in here that I found particularly interesting. So, we're going to look here for a second. What I've basically done is, I've taken essentially the pools, two pools in particular, and they're actually the pools that represent the HTTP and HTTPS traffic, and I've kind of subdivided that, so each of these systems has the notion of pools and entries, which represent the physical devices that are actually being load balanced. So, if I look closely here, you'll notice that basically I have the names of the systems, Sev Demo HA and Sev Demo 5, are listed in here as actually the elements that are being load balanced between them. What I'm looking to do with these particular graphs is I want to see how well the traffic is being balanced between these hosts. And again, this ties very closely to the type of algorithm that's actually being used to balance the traffic between these systems. So, normally it's sort of a persistent connection, round robin approach. Which means essentially for web connections we want them to be somewhat persistent so that there's the notion of a session, so as the user from the outside world makes basically subsequent connections to the same page, we're going to want to tie them to the same load balancer. And any new connections basically pick one of the entries out of the list and they stick to it. So what happened here is it turns out that the majority of the traffic is actually flowing to one of the members of the pool, so this may be an indication to us that we need to change the technique by which we're load balancing the system.

Similarly, we've got a disproportionate amount of traffic that's flowing out of the system as well. So, we can see that Sev Demo HA, in this particular case, has 98% of the traffic going outbound. Which tells me that something is really terribly wrong with this. That means that basically we either have a small number of sessions, in which case the users are sending oodles and oodles of data and they're persistently tied to this Sev Demo HA, so basically a lot of requests responds. Particularly if this is reporting appliance like it is in this particular case, most of the traffic will actually be outbound rather than inbound. So, you can see this in the absolute numbers here as well. Right, so the peak on the inbound side, the 6K. The peak on the outbound side was 220K. So, this is not atypical for the traffic, but this may represent an extremely small sample size, meaning a very small number of sessions to the appliance or it could be indicative of a larger problem basically indicating that we need to go and change the way that we're balancing the traffic between these systems.

This is all information that we're collecting today. I did want to point out a couple other things that are of interest. In here, I also put together, and again, apologies because I don't have ton of data on this because I just put it together this morning. I wanted to represent, essentially, a group availability report. So in this group availability report, I'm basically showing the number of active members, which is the line in blue and the number of actual members in the pool. So, this tells me that for that given pool, how many members are currently working as I expected. Now, interestingly enough, one of the our pool actually went offline last night. They didn't recover until about 7am this morning.

So, this is basically where that line is being drawn. So you see that we actually lost the HTTP and HTTPS members of the pool and then they recovered right around 9:17 in the morning.

This is particularly interesting for us because we can look and say from the F5 perspective, yeah, it's great, my website was always up, everything was doing exactly what it was supposed to do but it doesn't necessarily have a good sense of how close I am to any given failure unless I look at, essentially, how the members are doing at any given time. If I'm starting to see spikes and valleys in availability, it may be indicative that I have a system problem that I need to go address with the components of that pool.

This could be an extremely powerful tool to be able to look deeper into to whether or not I'm starting to have availability problems that haven't necessarily manifested in something my consumers can see yet.

Moving on there's also metrics that are specific to the device itself. For example, we're looking at some of the health statistics. So, this is a compute system like anything else. It has CPUs which are basically responsible for detecting availability of the members of the pool. It's supposed to essentially rewrite the packet so that the packets come in, they get redirected to the right host. It's supposed to make calculations about which host to redirect it to. In turn, it's also got available memory and available disK. Right? So these are all important aspects of whether or not this system is going to be doing what it's supposed to do. You can tell for the most part that this particular system is being very, very consistent in it's behavior. I don't see any memory spikes, I don't see any troughs. The amount of disk space that's being used is very, very consistent. This is all indicative that this system is doing exactly what it's supposed to do, which is really what we want.

Now, I'm going to pass along a couple words of warning with respect to the F5 systems. And for that I'm going to give you the over brief summary of SNMP. SNMP when you publish a MIB, normally that tends to be somewhat immutable. Meaning that the elements of the MIB are fairly static, the indexing for the tables is the same. You're allowed to make refinements to the MIB but they're usually additives. Right? So, I can add new elements to the end of the table, I can add new tables that were not there before. But usually, well it's not expressly written this way, it's normally considered bad form to go change tables that have already been published. However, our dear friend at F5 apparently didn't read that rule. So, You'll find that moving from different versions of the firmware of the F5 systems, you'll find that the table structures, and particularly their indexes, are dramatically different from one version to the next. So, previously things like this, the LPM pool, or I'm sorry the LPM Nodes which I'm displaying here, were indexed by essentially the version of the IP address and the IP address of the member of the pool. So, that tends to be, that was two indexes, they were all dotted decimal.

Unfortunately, with version 9 and above, they decided they wanted to change that to be an extremely long string that basically was this name/common/SevDemoHA. It's the name of the member of that pool. So, this tends to be pretty irritating when it comes to representing these statistics, especially if you have multiple versions of the F5s running out there. You need to maintain some sort of consistency between them and you need to be able to segment that based on the firmware version that's present. In order to do that for this particular demo I actually used keys based on the operating system to identify basically which way I wanted that to behave.

Beyond that, there's other key metrics in here. One of the other things that you may want to look at moving forward is looking at the number virtual address pools that are present. Those tend to be very significant in being able to tell whether or not the system is really close to it's limit. You really want to watch out for those kind of behaviors. And again, the redundancy here is really the basic functionality of the system. I don't want to belabor the point too much, but really you can go in and add additional metrics. At this point I'm looking at in and out server bytes, I also would want to look at the number session that are applicable for each one of them.

In fact, let me go ahead and add that just for the purpose of this particular demo.

So, again, we do have standard ICMP for the F5 device. I suspect that when you do deploy this configuration, the certification to the F5s, you're going to find that while we can discover the virtual address of the system, normally getting to it is going to prove to be a bit of a challenge depending on your placement. So, you can usually get the systems that are being load balanced, but not necessarily the virtual address due to firewall restrictions.

I think that covers everything that I had wanted to cover in this particular presentation. Again, F5 systems and A10 systems are certified and part of the product as we know it today. So, you'll find that most of these capabilities are already present and the data collection is already there. So, again just be careful on the version that you have and make sure that if you do run into problems with the data that's being presented, one of the main issues may be a disparity in the MIB definition.