General Notes About Uptime Checking and Monitoring
Why bother checking uptime? Aren't modern hardware and software already highly reliable? Yes, that's generally true for each individual item. However there are two other factors that make frequent checking highly desireable.
- Equipment Aggregation While each item in a network system may be very reliable, the entire system aggregates many, many individual items to create the function. The reliability of the function provided by the entire system becomes much lower than that of any element in it due to this aggregation.
- Human Intervention Many studies have shown that the largest impact on the reliabilty of a system comes from human intervention. Today's networks, including the Internet itself, are constantly in flux due to hardware and software upgrades, replacements and expansion; due to adapting to external factors such as legislative and competitive changes and changes in standards; and due to training and re-training of staff and new staff; among other influences. Taken together these actions are more likely to cause a system outage than just a simple failure of any one item, as illustrated by several recent high-profile outage incidents from this cause.
Uptime vs. Functional Checking
Why Just Check Uptime? Wouldn't it be better to check more aspects of the functionality? Yes, it is very possible to do so, and is often done, but there are tradeoffs.
- Delay and Latency Checking the details of the functionality inevitably takes longer than just checking uptime. With the overall up or down state of the aggregate function being the most critical factor, affecting all users at once, pure and simple uptime is the one alert you should hear about in the fastest, clearest way.
- Systems Loading Checking the details of the functionality inevitably loads both the network and the targets more heavily with more data exchanges just for monitoring. This loading reduces your systems' capacities for their revenue-producing functions.
- Complexity Checking the details of the functionality inevitably involves more complexity than a pure, simple uptime check. This complexity increases the risk of false, ambiguous, misleading or incomplete signals, which are best kept cleanly separated from pure uptime monitoring. Focusing purely on a fast, simple uptime check avoids any chance of this kind of mis-communication muddying the waters.
- Service Level Agreements SLA's for uptime or downtime are common. The cleanest and clearest compliance is proven by a simple, easily used, widely available, well-accepted, and ultra-reliable application focused exclusively on uptime. At times it can be important that the same results can easily be obtained by both sides of the agreement by employing a simple, easily-run app freely available to all.
This section will discuss the effects of network latency (delays) on the choice monitoring configuration parameters.
Alert Ack versus Alert Slowdown
This section will discuss the tradoffs between the alert ack and alert slowdown mechanisms.