Basic Principles of monitoring with Check_MK
Last updated: March 18. 2016
Until now we have been concerned with the installation and implementation of Check_MK. Now it is time to begin explaining the basic concepts and definitions of monitoring (with Check_MK), prior to immersing ourselves in its technical details. In this article terms such as states, events, alarms, notifications, downtimes, acknowledgements, hosts, services, checks and many more will be covered.
Check_MK adopted its structure from Nagios and is compatible with
Nagios in many ways. The
1. States and events
It is important to understand the basic differences between states and events - and namely for a very practical benefit. Most classic IT monitoring systems revolve around events. An event is something that occurs uniquely at a particular time. A good example would be a SCSI timeout when accessing drive X. Typical sources of events are syslog messages, SNMP traps, the Windows Event Log, and log data entries. Events are quasi-spontaneous (self-generating, asynchronous) occurrences.
In contrast a state describes a sustained situation, e.g. drive X is online. In order to observe the state of something, the monitoring system must regularly poll it. As the example shows, in monitoring it is often possible to choose to work with events or with states.
The Check_MK Monitoring System can accommodate both states and events, but, where the choice is available, always prioritize state-based monitoring. The reason for this lies in the numerous advantages of this method. Some of these are:
One can well say that Check_MK's states-based monitoring is the norm. For the processing of states, the Check_MK Event Console is also available. This is specialised for the correlation and evaluation of large numbers of events and is seamlessly-integrated into the Check_MK platform.
2. Hosts and services
Everything in Check_MK revolves around hosts and services. A host can be many things, e.g.:
In monitoring a host always has one of the following states:
Alongside the state, a host has a number of other attributes that can be configured by the user, e.g.:
In order for the monitoring to be able to assess the UNREACH status it must know via which path every individual host can be reached. Additionally for every host one or more so-called parent hosts may be specified. If, e.g. server A seen from monitoring is only accessible via router B, then B is a parent of A. In this way only direct parents are configured. Further, a tree-like structure is created with the monitoring server at its centre (shown here as ):
In this example, if the host switch-intern shows the DOWN state, the monitoring automatically assumes that an eventual failure of tauschzone.de can be explained simply as due to its no longer being accessible to monitoring. Whether it really has failed cannot be determined. It will be classified as UNREACH in monitoring. But - the monitoring of tauschzone.de will nevertheless continue! If this host answers, then UP will be shown in any case.
A host has a number of services. A service can be anything - please don't confuse this with services in Windows. A service is any part or aspect of the host that can be OK, or not OK. Naturally the state can only be determined if the host is in an UP condition.
A service being monitored can have the following states:
When determining which condition is 'worse', Check_MK utilises the following sequence:
3. Host and Service Groups
Hosts and services can be grouped for an overview. In this way a host/service can be in more than one group. These groups are purely optional and not required for the configuration. Host groups can be useful when, alongside the folder structure in which the hosts are managed, an additional grouping is desired. If for example you have built a folder structure according to geographic standpoints, then it could be useful to have a Linux-Server host group e.g., that lists all Linux servers regardless of their geographic locations.
4. Contacts and contact groups
Contacts and contact groups offer the possibility of assigning persons to hosts
and services. A contact correlates with a user name or web interface. The
correlation with hosts and services does not occur directly however, rather
via contact groups. Firstly, a contact (e.g. harri) is assigned
to a contact group (e.g.
These assignments are useful for a number of reasons:
By the way - the user omdadmin (respectively cmkadmin starting at 1.4.0), who is automatically defined by the creation of an instance, is always permitted to view all hosts and services even when they are not a contact. This is determined through their role as administrator.
5. Users and roles
Whereas the persons who are responsible or authorised for a particular host or service are defined through contacts and contact groups, their privileges are controlled via roles. Check_MK is supplied with three roles from which further roles can be later derived. Each role defines a series of rights which may be customised. The standard roles have the following meanings:
6. Problems, alarms and notifications
6.1. Handled and unhandled problems
Check_MK identifies every host that is not UP, and every service that is not OK as a problem. A problem can have two states: unhandled and handled. The procedure is that a new problem is first treated as unhandled. As soon as someone confirms (acknowledges) the problem it is then flagged as handled. It can also be said that unhandled problems are those which nobody has attended to. The tactical overview in the sidebar therefore differentiates the two types of problems:
By the way: service problems from hosts that are currently not UP are not identified as problems.
Further details about acknowledgements can be found in it's own article.
6.2. Alerts and notifications
When a host's condition changes, (e.g. from OK to CRIT), Check_MK registers an alert. These alerts may or may not generate a notification. Check_MK is so designed that whenever a host or service has a problem, an email is sent to the object's contacts (please note that omdadmin/cmkadmin, by default, is not a contact for any objects). These can be customised very flexibly however. An alert also depends on a number of parameters. It is simplest when we look at cases for which notifications are not sent. Notifications are suppressed ...
If none of these prerequisites for supressing notifications are satisfied, the monitoring core then creates a notification, which in a second step passes through a chain of rules. In these rules you can define further exclusion criteria, and decide whom should be alerted and in what form (email, SMS, etc.)
All particulars concerning alerts can be found in it's own article.
6.3. Flapping hosts and services
It sometimes happens that a service continuously and quickly changes its condition. In order to avoid continuous notifications, Check_MK switches such a service into the flapping state. This is illustrated with the symbol. When a service enters a flapping state, a notification will be generated which informs the user of the change, and silences further alerts. After a suitable time, if no further rapid changes are occurring, and a final (good or bad) status is evident, then the flapping status disappears and normal alerting resumes.
6.4. Scheduled downtimes
If you perform maintenance work on a server, device or software, you will normally want to avoid potential problem notifications during this time. In addition, you will probably want to advise your colleagues that problems appearing in monitoring during this time may be temporarily ignored.
For this purpose you can enter a condition of scheduled downtimes on a host or service. This can can be done directly before starting the work, or in advance. Scheduled downtimes are illustrated by the symbols:
While a host or service has a scheduled downtime:
Additionally, when you wish to later document statistics on the availability of hosts and services it is a good idea to include scheduled downtimes. These can be factored into later availability evaluations.
Timeperiods () define regular, weekly-recurring timeperiods that are used in various positions in the monitoring's configuration. A typical timeperiod could be called workhours and could contain the time from 8:00 to 17:00 on all weekdays except Saturday and Sunday. The period 24X7 simply includes all times and is predefined. Timeperiods can also include exceptions for particular calendar days - e.g. Bavarian public holidays.
Some important situations which use time periods are:
8. Check interval, check attempts and check period
The execution of checks occurs at fixed intervals in status-based monitoring. Check_MK uses one minute as its standard. Every check is therefore performed once per minute. This can be altered in the configuration:
Through defining a check period other than 24X7, the execution of checks can be interrupted in specified time frames. The service's status will no longer be updated, and will be flagged as stale, symbolised by .
In combination with a long check interval one can ensure that a check is performed once per day at a specified time. If you set an interval of e.g. 24 hours and the check period at 02:00 - 02:01 on every day (only one minute per day), then Check_MK will ensure that the check really will be executed in this short time frame.
With the aid of max check attempts you can avoid alerts in the case of sporadic errors. In this way you are effectively making a check less sensitive. If the check attempts are set to e.g. 3, and the corresponding service becomes CRIT, then initially no notification will be generated. If the the next two checks produce a result other than OK, the number of current attempts will increase to 3 and a notification will be sent.
9. Passive Checks
If you look at the Check_MK interface you can see that for some services a green double-arrow () is shown, but a grey arrow () for most others. The services with the green arrow are active checks. These are executed by Check_MK directly. Sevices with a grey arrow are those for which the check results are determined by the active check Check_MK. These occur for performance reasons and illustrate a special feature of Check_MK:
In order that the target system (server, network device, etc.) is not newly-contacted for every single service, once per interval Check_MK collects all important data in one pass. From this data, in a single action it calculates new results for all passive checks. This conserves CPU resources on both systems and is an important factor that supports Check_MK's high performance and scalability.
10. Overview of the most important host and service icons
The following table provides a short overview of the most important status icons appearing beside hosts and services: