Basic Principles of monitoring with Check_MK

Last updated: March 18. 2016


Until now we have been concerned with the installation and implementation of Check_MK. Now it is time to begin explaining the basic concepts and definitions of monitoring (with Check_MK), prior to immersing ourselves in its technical details. In this article terms such as states, events, alarms, notifications, downtimes, acknowledgements, hosts, services, checks and many more will be covered.

Check_MK adopted its structure from Nagios and is compatible with Nagios in many ways. The  Check_MK Raw Edition is still built upon the proven Nagios core. The  Check_MK Enterprise Edition however uses its own, completely new core development: the Check_MK micro core. It can also be converted to Nagios. This handbook however requires no previous knowledge of Nagios or related systems.

1. States and events

It is important to understand the basic differences between states and events - and namely for a very practical benefit. Most classic IT monitoring systems revolve around events. An event is something that occurs uniquely at a particular time. A good example would be a SCSI timeout when accessing drive X. Typical sources of events are syslog messages, SNMP traps, the Windows Event Log, and log data entries. Events are quasi-spontaneous (self-generating, asynchronous) occurrences.

In contrast a state describes a sustained situation, e.g. drive X is online. In order to observe the state of something, the monitoring system must regularly poll it. As the example shows, in monitoring it is often possible to choose to work with events or with states.

The Check_MK Monitoring System can accommodate both states and events, but, where the choice is available, always prioritize state-based monitoring. The reason for this lies in the numerous advantages of this method. Some of these are:

  • A failure in the monitoring itself will be immediately recognised.
  • Regular checking in a fixed time-frame enables the capturing of performance data.
  • Check_MK itself can control the rate at which states are polled. There is no risk of an event storm in global error situations.
  • Even in chaotic situations - e.g. a power failure in a computer centre - one always has a reliable overall status.

One can well say that Check_MK's states-based monitoring is the norm. For the processing of states, the Check_MK Event Console is also available. This is specialised for the correlation and evaluation of large numbers of events and is seamlessly-integrated into the Check_MK platform.

2. Hosts and services

2.1. Hosts

Everything in Check_MK revolves around hosts and services. A host can be many things, e.g.:

  • A server
  • A network device (switch, router, load balancer)
  • A measuring device with an IP connection (thermomether, hydrometer)
  • Anything else with an IP address
  • A cluster of several Hosts

In monitoring a host always has one of the following states:

State Colour ID Meaning
UP green 0 The host is accessible via the network (this generally means that it answers a PING.)
DOWN red 1 The host does not answer network enquiries, is not accessible.
UNREACH orange 2 The path to the host is currently blocked to monitoring, because a router or switch in the path has failed.
PEND grey The host has been newly-included in the monitoring, but never before been polled. Strictly-speaking this is not really a condition.

Alongside the state, a host has a number of other attributes that can be configured by the user, e.g.:

  • A unique name
  • An IP address
  • Optional - an alias, that must not be unique
  • Optional - one or more parents

2.2. Parents

In order for the monitoring to be able to assess the UNREACH status it must know via which path every individual host can be reached. Additionally for every host one or more so-called parent hosts may be specified. If, e.g. server A seen from monitoring is only accessible via router B, then B is a parent of A. In this way only direct parents are configured. Further, a tree-like structure is created with the monitoring server at its centre (shown here as ):

In this example, if the host switch-intern shows the DOWN state, the monitoring automatically assumes that an eventual failure of tauschzone.de can be explained simply as due to its no longer being accessible to monitoring. Whether it really has failed cannot be determined. It will be classified as UNREACH in monitoring. But - the monitoring of tauschzone.de will nevertheless continue! If this host answers, then UP will be shown in any case.

The parent's most important task is the recognition of network failures and avoiding mass false alarms in such situations.

2.3. Services

A host has a number of services. A service can be anything - please don't confuse this with services in Windows. A service is any part or aspect of the host that can be OK, or not OK. Naturally the state can only be determined if the host is in an UP condition.

A service being monitored can have the following states:

State Colour ID Meaning
OK green 0 The service is fully in order. All values are in their allowed range.
WARN yellow 1 The service is functioning normally, but its parameters are outside their optimal range.
CRIT red 2 The service has failed
UNKNOWN orange 3 The service's status cannot be correctly determined. The monitoring agent has delivered defective data or the element being monitored has disappeared.
PEND grey The service has been newly-included and has so far not provided monitoring data.

When determining which condition is 'worse', Check_MK utilises the following sequence:

OKWARNUNKNOWNCRIT

3. Host and Service Groups

Hosts and services can be grouped for an overview. In this way a host/service can be in more than one group. These groups are purely optional and not required for the configuration. Host groups can be useful when, alongside the folder structure in which the hosts are managed, an additional grouping is desired. If for example you have built a folder structure according to geographic standpoints, then it could be useful to have a Linux-Server host group e.g., that lists all Linux servers regardless of their geographic locations.

4. Contacts and contact groups

Contacts and contact groups offer the possibility of assigning persons to hosts and services. A contact correlates with a user name or web interface. The correlation with hosts and services does not occur directly however, rather via contact groups. Firstly, a contact (e.g. harri) is assigned to a contact group (e.g. linux-admins). Then hosts - or as required, individual services - can be assigned to the contact group. In this way users, and likewise hosts and services can be assigned to multiple contact groups.

These assignments are useful for a number of reasons:

  • Who is permitted to view something?
  • Who is authorised to configure and control which hosts and services?
  • Who receives notifications for which problems?

By the way - the user omdadmin (respectively cmkadmin starting at 1.4.0), who is automatically defined by the creation of an instance, is always permitted to view all hosts and services even when they are not a contact. This is determined through their role as administrator.

5. Users and roles

Whereas the persons who are responsible or authorised for a particular host or service are defined through contacts and contact groups, their privileges are controlled via roles. Check_MK is supplied with three roles from which further roles can be later derived. Each role defines a series of rights which may be customised. The standard roles have the following meanings:

RoleMeaning
adminMay view all, has all privileges
userMay only view that for which he/she is a contact. May manage hosts in folders assigned to him/her. Is not permitted to make global settings
guestMay view all, but may not configure and may not influence monitoring

6. Problems, alarms and notifications

6.1. Handled and unhandled problems

Check_MK identifies every host that is not UP, and every service that is not OK as a problem. A problem can have two states: unhandled and handled. The procedure is that a new problem is first treated as unhandled. As soon as someone confirms (acknowledges) the problem it is then flagged as handled. It can also be said that unhandled problems are those which nobody has attended to. The tactical overview in the sidebar therefore differentiates the two types of problems:

By the way: service problems from hosts that are currently not UP are not identified as problems.

Further details about acknowledgements can be found in it's own article.

6.2. Alerts and notifications

When a host's condition changes, (e.g. from OK to CRIT), Check_MK registers an alert. These alerts may or may not generate a notification. Check_MK is so designed that whenever a host or service has a problem, an email is sent to the object's contacts (please note that omdadmin/cmkadmin, by default, is not a contact for any objects). These can be customised very flexibly however. An alert also depends on a number of parameters. It is simplest when we look at cases for which notifications are not sent. Notifications are suppressed ...

  • ...when notifications have been globally-deactivated in the master control
  • ...when notifications have been deactivated in the host/services
  • ...when notification is deactivated for a particular status of the host/services (e.g. no notification for WARN)
  • ...when the problem affects a service whose host is DOWN or UNREACH
  • ...when the problem affects a host, whose parents are all DOWN or UNREACH
  • ...when for the host/service a notification period has been set that is not currently active (see below)
  • ...when the host/service is currently flapping (see below)
  • ...when the host/service is currently in a scheduled downtime (see below)

If none of these prerequisites for supressing notifications are satisfied, the monitoring core then creates a notification, which in a second step passes through a chain of rules. In these rules you can define further exclusion criteria, and decide whom should be alerted and in what form (email, SMS, etc.)

All particulars concerning alerts can be found in it's own article.

6.3. Flapping hosts and services

It sometimes happens that a service continuously and quickly changes its condition. In order to avoid continuous notifications, Check_MK switches such a service into the flapping state. This is illustrated with the symbol. When a service enters a flapping state, a notification will be generated which informs the user of the change, and silences further alerts. After a suitable time, if no further rapid changes are occurring, and a final (good or bad) status is evident, then the flapping status disappears and normal alerting resumes.

6.4. Scheduled downtimes

If you perform maintenance work on a server, device or software, you will normally want to avoid potential problem notifications during this time. In addition, you will probably want to advise your colleagues that problems appearing in monitoring during this time may be temporarily ignored.

For this purpose you can enter a condition of scheduled downtimes on a host or service. This can can be done directly before starting the work, or in advance. Scheduled downtimes are illustrated by the symbols:

The host/service is in a scheduled downtime
The host on which the service is located has a scheduled downtime

While a host or service has a scheduled downtime:

  • No notifications will be sent.
  • Problems will not be shown in the tactical overview.

Additionally, when you wish to later document statistics on the availability of hosts and services it is a good idea to include scheduled downtimes. These can be factored into later availability evaluations.

7. Timeperiods

Timeperiods () define regular, weekly-recurring timeperiods that are used in various positions in the monitoring's configuration. A typical timeperiod could be called workhours and could contain the time from 8:00 to 17:00 on all weekdays except Saturday and Sunday. The period 24X7 simply includes all times and is predefined. Timeperiods can also include exceptions for particular calendar days - e.g. Bavarian public holidays.

Some important situations which use time periods are:

  • Limiting the time during which notifications will be made (notification period)
  • Limiting the time during which checks are to be performed (check period)
  • Service times for the evaluation of availability (service period)
  • Times during which the event console applies defined rules

8. Check interval, check attempts and check period

The execution of checks occurs at fixed intervals in status-based monitoring. Check_MK uses one minute as its standard. Every check is therefore performed once per minute. This can be altered in the configuration:

  • To a longer interval in order to save CPU resources on the server and target systems
  • To a shorter interval in order to receive alerts more quickly and to collect performance data at a higher resolution.

Through defining a check period other than 24X7, the execution of checks can be interrupted in specified time frames. The service's status will no longer be updated, and will be flagged as stale, symbolised by .

In combination with a long check interval one can ensure that a check is performed once per day at a specified time. If you set an interval of e.g. 24 hours and the check period at 02:00 - 02:01 on every day (only one minute per day), then Check_MK will ensure that the check really will be executed in this short time frame.

With the aid of max check attempts you can avoid alerts in the case of sporadic errors. In this way you are effectively making a check less sensitive. If the check attempts are set to e.g. 3, and the corresponding service becomes CRIT, then initially no notification will be generated. If the the next two checks produce a result other than OK, the number of current attempts will increase to 3 and a notification will be sent.

A service that finds itself in this intermediate state - is thus not OK, but has not yet reached its maximum number of attempts - has a soft state

9. Passive Checks

If you look at the Check_MK interface you can see that for some services a green double-arrow () is shown, but a grey arrow () for most others. The services with the green arrow are active checks. These are executed by Check_MK directly. Sevices with a grey arrow are those for which the check results are determined by the active check Check_MK. These occur for performance reasons and illustrate a special feature of Check_MK:

In order that the target system (server, network device, etc.) is not newly-contacted for every single service, once per interval Check_MK collects all important data in one pass. From this data, in a single action it calculates new results for all passive checks. This conserves CPU resources on both systems and is an important factor that supports Check_MK's high performance and scalability.

10. Overview of the most important host and service icons

The following table provides a short overview of the most important status icons appearing beside hosts and services:

A click on this icon executes an immediate check of this service
A click indirectly executes a check by initiating Check_MK services
This host/service currently has a scheduled downtime at the moment
This service's host currently has a scheduled downtime at the moment
This host/service is currently outside its notifications periods
Notifications for this host/service are currently deactivated
Checks for this service are currently deactivated
This Host/Service has a status of stale
This host/service has a status of flapping
This host/service has a confirmed problem
There is a comment for this host/service
This host/service is a part of a BI aggregation
Here you can directly-access the settings for the check parameters
Only for logwatch services: here you can access stored log files
Here you can acccess a timegraph of the performance data
Here you can access an overview of the predictive monitoring
This host/service has inventory data. A click on it shows the related view.
This host/service is one of your favorites