Of course, one of the top priorities for any site manager is to ensure that the resource runs smoothly. However, periodic «downs » site crashes are inevitable, and the main thing here — to track and solve the problem in time.
Obviously, no one is able to monitor the performance of the site around the clock. Moreover, the resource can be unavailable in another region, and this manager will not trace it in any way.
And it is to solve these problems is designed service HostTracker, which monitors the availability of the site. It records the «downfall» site, analyzes the problem and sends an alarm to the administrator or the resource management.
It is obvious that nobody needs a false alarm, and the principle «better safeguard than sorry;— is not the best strategy in this case. That's why in the work of the service it is necessary to be extremely accurate and adequate in assessing problems.
HostTrekker therefore has a number of critical tasks: to track and notify the customer in time, and to avoid false alarms, and to calculate uptime based on best-case and worst-case scenarios.
How do you log a direct «drop» resource?
Which is the best and worst case scenario?
As soon as a customer adds a website, the system sends out a request at a fixed interval between one minute and one hour. At that, such checking is performed from independent servers scattered all over the world to perform geographically distributed monitoring. At the moment, there are more than fifty such servers. A specific agent is randomly selected.
If a validation error is returned, a retest is run for five to seven more independent agents. If, in most cases, the problem is confirmed, the resource is considered «fallen». If the other agents did not detect any problems, it is assumed that the local problem occurred on a particular agent.
If it is necessary to determine whether a site is up, the same algorithm is applied. It virtually eliminates the possibility of false alarms, thus protecting the peace of mind of the service customers. The inaccessibility of the resource is established only after multiple checks with a certain interval.
Of course, it is impossible to guarantee for one hundred percent exactly what state the site was in between checks. However, with the highest probability in the interval between checks that gave out an error site & ‖ lies down‖. However, if after an error begins recovery, between checks the resource can still work. Actually this scenario is the basis for optimistic uptime calculation. The variant of «lying» site between checks becomes a starting point for calculating a pessimistic scenario.
The optimistic scenario is taken into consideration during statistical analysis, but in case of notifying the clients the data is specified for the pessimistic scenario.
In this way, thanks to the calculation of all variants and careful comprehensive monitoring, the customer receives timely notifications only in case of real problems and can get a complete and reliable picture of what is happening.