There is an over abundance of products and services on the market geared toward monitoring availability. These products and services range from large commercial products to cloud-based services and, finally, to smaller open source products. Almost all of them use the same underlying techniques to monitor – http, SNMP, and ICMP queries are perhaps the most common – yet they typically have different back ends (data bases) and reporting front ends. One item they nearly all have is an overly simplified method of reporting “availability.”
Availability, in the broadest sense, is typically divided into three categories:
- Network Availability
- System Availability
- Application Availability
In the case of the first two items – network and system availability – the most common metric used to determine availability is an ICMP ping. The most common method of determining application availability is along the lines of utilizing an http query. Network devices such as routers are also often measured using an snmp query to determine the status of its interface components.
Though these are all useful techniques in measuring specific metrics, using one of them alone does not – and can not – provide a true representation of availability. Take, for example, a remote server running a network application, which for some reason, a remote user is unable to access. What does basic monitoring tell us? Let’s look at the possibilities:
- An HTTP query/web pull of the application fails. This may indicate that the application availability is 0%. However, the application owner insists that the application is running perfectly fine.
- An ICMP ping of the server indicates the server is unavailable. This indicates that the server is down. However, the system administrator, like the application owner, insists this is not the case; the server is running, yet system availability is reported as 0%.
- An ICMP ping of the switch that the server is connected to also fails. The LAN availability report is now negatively affected, although the LAN engineer confirms the switch is fine and all ports are fine.
- Further investigation reveals that the nearest router to the application has failed. Network availability is definitely impacted.
In this scenario the impact point is the router, yet all availability metrics are impacted. There are arguments to be made as to why all metrics should be impacted, but there are also arguments to be made as to why they should not.
This type of scenario becomes further compounded in a distributed architecture where either: applications are distributed, users are distributed, or both applications and users are distributed. A local issue affecting the users at one site from reaching a remote application would readily be attributed to "network availability/unavailability." However, if we turn the equation around and to read a local issue affecting the application's ability to receive remote requests we may be more inclined to reference the issue as "application availability/unavailability."
The primary issue surrounding availability metrics and availability reporting revolves around the precise definition of availability and the terminology with it. Defining availability must be left to the business and not the monitoring solution itself, which can often supply only the tools necessary to provide the data. Monitoring tools themselves do not usually do themselves any favors by providing generic reports with the name “availability” attached to them.
A more precise set of definitions from a monitoring-solution-centric viewpoint might do away with the term availability entirely in lieu of terms like the following:
|ICMP Reachability||The ability of a remote user, device, or service to reach the device in question with an ICMP Ping.|
|Monitoring Reachability||The ability of a remote management system to be able to monitor the device or service using the chosen monitoring method, such as SNMP, WMI, HTTP queries and so forth.|
|Protocol Reachability||The ability of a remote management system to be able to reach the application or service device using the protocol that the application or service runs on. This would include TCP Ports 80 and 443 for web applications, for instance.|
|Service Readiness||The time the device is ready to serve data, voice, or the service it is required to serve, regardless of any user’s ability to reach this device. This can often be referenced by monitoring the system uptime of a device via SNMP.|
|Application Readiness||The ability of the service to actually respond to incoming requests.|
If we take these five terms and begin to replace these over more generic availability terms we’ll be able to generate a more meaningful Service Availability metric.