Notifications

On this page


1. Introduction

Notifications mean that Check_MK actively informs users in case of problems or other important events. This is most commonly achieved using emails. There are numerous other methods – such as sending an SMS, or the redirection to a ticket system. Check_MK provides a simple interface to enable you to script your own notification method.

The starting point for every notification is an Event. This is always related to a particular host or service. Possible event types are:

  • a change of state (e.g., OKWARN)
  • a change from a steady to an unsteady (flapping) state
  • the start or end of a planned downtime
  • the Confirmation of a problem (Acknowledgement) by a user
  • an event arising from a manually-triggered notification Command
  • the execution of an Alert handler (from CEE Version 1.4.0i2)
  • an event passed for notification from the Event Console

Check_MK utilises a rules-based system for configuring the notifications – with which it can implement very demanding requirements. A simple notification by email – which is entirely satisfactory in many cases – is nonetheless quick to set up.

2. To notify, or not (yet) to notify?

Notifications are basically optional, and Check_MK can still be used efficiently without them. Some large organisations have a sort of control panel in which an operations team has Check_MK's console constantly under observation, and thus additional emails are unnecessary.

If your Check_MK-environment is still under construction, it should be considered that notifications will only be of help to your colleagues when no – or only occasional – error alarms are produced. One first needs to come to grips with the threshold values and all other settings, so that also under normal circumstances everything is in the “green range”. Acceptance for the new monitoring installation will quickly fade if every day the inbox is flooded with hundreds of useless emails.

The following procedure has proven to effectively tune notifications:

Step 1: Tune the monitoring and eliminate false error messages. Rectify newly-identified, genuine problems via Check_MK. Do this until everything is “normally” OK / UP.

Step 2: Next switch the notifications to be active only for yourself. Reduce the “static” caused by sporadic, short duration problems. Adjust further threshold values, make use of Predictive Monitoring, increase the Maximum number of check attempts, or utilise Delay ... notifications as needed. If genuine problems are responsible, attempt to get them under control.

Step 3: Once your own inbox is tolerably peaceful, activate the notifications for your colleagues. Create efficient contact groups so that each contact only receives notifications relevant to them.

This will result in a system which provides relevant information that assists in reducing outages.

3. Simple notifications by email

3.1. Prerequisites

In the default Check_MK configuration, a user will receive notifications per email when the following prerequisites have been satisfied:

  • The Check_MK-server has a functioning setup for sending emails.
  • An email address has been configured for the user.
  • The user is a member of a contact group, and is therefore a contact.
  • An event occurs on a host or service that is assigned to this contact group.

Check_MK sends HTML-emails that also include the current values for the affected service:

3.2. Setting up mail dispatching in Linux

For the successful sending of emails, your Check_MK-server must have a functioning SMTP-server configuration. Depending on your Linux distribution, this could utilise, for example, Postfix, Qmail, Exim, Sendmail or Nullmailer. The configuration will be implemented with your Linux distribution's resources.

The configuration is generally confined to registering a “smart host” (also known as an SMTP-relay-server) to which all emails will be directed. This will then be your firms internal SMTP-Mailserver. As a rule smart hosts don't require authentification in a LAN – which makes matters simple. In some distributions the smart host will be queried during the installation. With the Check_MK-Appliance one can configure the smart host conveniently via the Web-GUI.

You can test the sending of emails easily with the mail command on the command line. Because there are numerous different implementations for this command under Linux, for standardisation Check_MK provides the version from the Heirloom mailx project directly in the instance user's search path (as ~/bin/mail). The best way to test is as an instance user, since the notification scripts will later run with the same permissions.

The email's content is read from the standard input, the subject specified with -s, and the recipient's address simply appended as an argument to the end of the command line:

OMD[mysite]:~$ echo “content” | mail -s test-subject harri.hirsch@example.com

The email should be delivered without delay. If this doesn't work, information can be found in the SMTP-server's log file in the /var/log diectory (see Files and directories).

3.3. Email address and contact group

A user's email address and contact group are defined in the User management:

In a freshly-generated Check_MK-instance, initially there is only the Everything contact group. Members of this group are automatically responsible for all hosts and services, and will be notified of every relevant monitoring event by email.

Tip: if your Check_MK-installation has been generated with an older version, this group could also possibly be called Everybody. This is however illogical, as this group doesn't contain every user, rather it contains all hosts! Apart from the differing names the function is the same.

3.4. Testing

To test the notification you could simply set an OK-service to CRIT manually. This is done with the Fake check results command. This should immediately trigger an email. At the next regular check the service should revert to OK, thereby triggering a new notification (this time a Recovery).

Please note that during these tests, by making frequent changes to its state the service will after a while enter the flapping state. Subsequent state changes will no longer trigger notifications. In the Master control you can temporarily deactivate the detection of flapping (Flap detection).

Alternatively, you can also send a custom notification – which will not alter the status of the affected service. This generated notification is however of a slightly different type and – depending on your notification rules – it can behave differently.

4. Controlling notifications with rules

4.1. The basic principle

Check_MK is configured 'by default' so that when an event occurs an email is sent to every contact of the relevant host or service. This is certainly initially sensible, but in practice many further desires arise, for example:

  • The suppression of specific less useful messages
  • The ‘subscription’ to messages from services for which one is not a contact
  • The notification by email, SMS or pager depending on the time of day
  • The escalation of problems when no acknowledgement has been received beyond a certain time
  • The option of NO notification for the WARN or UNKNOWN states
  • and much more...

Via a rules based mechanism Check_MK provides maximum flexibility for satisfying such demands. Using the Notifications WATO-Modul one can manage a chain of notification rules which define whom should be notified, as well as when and how.

When any monitoring event occurs this rule chain will be run through from top to bottom. As always, every rule firstly has a condition that decides whether the rule actually applies to the situation in question. If the condition is satisfied for this specific event the rule determines two things:

  • A selection from the contacts (Who should be notified?)
  • The notification method (e.g. HTML-email), and optional, additional parameters

In contrast to the Rules for host and service parameters here the evaluation also continues after the applicable rule has been satisfied! Subsequent rules can add further notifications. Notifications generated by preceeding rules can also be deleted. The end result of the rule evaluation will be table with a structure something like this:

Who (Contact) How (Method)Parameter
Harri HirschEmailReply-To: linux.group@example.com
Bruno WeizenkeimEmailReply-To: linux.group@example.com
Bruno WeizenkeimSMS

Now for each entry in this table, the notification script which actually executes the notification appropriate to the method is invoked.

4.2. Predefined rules

If you have freshly installed Check_MK, precisely one rule will have been predefined:

This rule defines the above-described default behaviour. It is structured as follows:

Conditionsnone – applies to all events
Methodsends an email in HTML-Format (with embedded metric graphs)
Contactsall contacts for the affected host/service

As usual, the rule can be edited , copied or deleted , or a new rule can be created. Once you have more than a single rule, their processing sequence can be altered with the symbol.

Note: Changes to notification rules do not require an Activate Changes, rather they take effect immediately!

4.3. Structure of the notification rules

General characteristics

As with all rules in Check_MK, here you can include a description and a commentary for the rule, or even temporarily deactivate the rule. The allow users to deactivate this notification option is activated by default. This allows users to “unsubscribe” from notifications generated by this rule. How this works is described later.

Notification methods

The notification method specifies the technique to be used for sending the notification. (e.g., HTML Email). Each method is realised using a script. Check_MK is provided including a number of scripts. You can also quite easily write your own scripts in any desired programming language in order to implement special notifications (for example, to redirect a notification to your own ticket system).

One method can offer parameters – such as allowing the method that sends ASCII and HTML-emails to explicitly set the sender's address (From:) for example.

Before making settings directly in the rule, one should know that parameters for the notification methods can also be specified via Host and service rules: in the case of the host and service parameters, under Monitoring Configuration ➳ Notifications can be found a rule set for every notification method with which the same settings can be defined – and as usual, even depending on the host or service.

Parameter definitions in notification rules enable these settings to be varied in individual cases. So you can, for example, define a global ‘Subject’ for your email, but also with an individual notification rule define an alternative ‘Subject’.

Instead of parameters you can also select Cancel all previous notifications – with which all of this method's notifications from prior rules will be deleted. More on this later.

Contact selection

If the preconditions for a rule have been met, the contact selection will come next. The most common procedure is for notifications to be sent to all users who have been registered as contacts for the respective host/service. This is the “normal” and logical procedure, since it is also via the contacts that it is defined which objects each user receives in their GUI display – in effect those objects for which the user is responsible.

In the Contact Selection submenu you can check multiple option boxes and thus extend the notification to more contacts. Check_MK will automatically delete duplicate contacts. In order for the rule to make sense at least one selection must be made.

The two Restrict by... options function somewhat differently. Here the contacts selected with the other options will again be restricted. With these you can also create an AND-operator between contact groups, for example, to enable notifications to be sent to all contacts who are members of both the Linux and Datacenter groups.

By entering explicit email addresses you can notify persons who are not in fact nominated as users in Check_MK. This of course only makes sense when used in the notification method that actually sends the emails.

If, in the method, you have selected Cancel all previous notifications, the notifications will only be deleted for the contact selected here!

Conditions

Conditions determine when a rule will be used. If no conditions have been defined the rule will take effect for every event. Details regarding the effects of the various conditions can be found in the online help .

For comprehension it is important to remember that the source is always an event on a concrete host or service. The conditions address the object's static attributes – (e.g., whether the service name contains the /tmp text), with the current state – (e.g., whether the service has just changed from OK to CRIT), or with other things – (e.g., whether the working time timeperiods are currently active).

Even if only one of the configured conditions is not met by the event, the rule will not be applied. A special feature in this case are the Match host event type and Match service event type conditions:

Should you select only Match host event type, the rule will match no single service alarm, and vice versa. Should you activate both conditions however, the rule will match if the event type is activated in either of the two check boxes. In this exceptional case these conditions will thus not be linked with a logical 'AND', but rather with an 'OR'. In this way you can simply administer host and service alarms with a single rule.

A further tip regarding the Match contacts and Match contact groups conditions: Here as a condition it will be tested whether the relevant host or service has a specific contact allocation – so that one can perform functions such as “Notifications should never be sent by SMS to hosts in the Linux contact group”. This has nothing to do with the contact selection described above:

4.4. Cancelling notifications

When selecting a method you will also find the Cancel all previous notifications option. In order to be able to understand the function of such a rule, it is best to imagine the table of notifications as a graphic. Assuming the processing of the rules for a concrete event is partly complete, and that due to a number of rules the following three notifications have been triggered:

Who (Contact)How (Method)
Harri HirschEmail
Bruno WeizenkeimEmail
Bruno WeizenkeimSMS

Now comes a rule with the SMS method and the Cancel previous notifications selection. The contact selection chooses the Windows group, in which Bruno Weizenkeim is a member - and then the line 'Bruno Weizenkeim / SMS ' will be deleted from the table. Once the rule has been processed the table will look like this:

Who (Contact)How (Method)
Harri HirschEmail
Bruno WeizenkeimEmail

Should a subsequent rule again define an SMS notification for Bruno, then this rule will have priority and the SMS will be added anew to the table. To summarise:

  • Rules can suppress (delete) specific notifications.
  • Deletion rules must come after the rules that create the notifications.
  • A deletion rule does not 'cancel' a preceeding rule, rather it suppresses the notifications that are generated by (possibly multiple) preceeding rules.
  • Subsequent rules can reinstate the previously suppressed notifications.

4.5. What happens if no rule is applicable?

One who configures can also make errors. One possible error in notifications could be that a critical monitoring problem is discovered, but not a single notification rule takes effect.

To avoid such situations, in the Global settings Check_MK provides the Notifications ➳ Fallback email address for rule based notifications setting. Enter an email address here. This email address will then receive notifications for which no notification rule applies.

The fallback address will however only be used if no rule applies, not when no notification has been triggered! The explicit suppression of notifications is desired – it is not a configuration error.

From Check_MK-Version 1.4.0i1 the entry of a fallback address will be ‘recommended’ with an onscreen warning:

If you don't want emails to be sent to this address, simply add as the very first rule a rule that deletes all preceeding notifications. This rule is for notifications ineffective since here no notifications will be generated. But with this you can ensure that at least one rule will always apply, thus allowing this warning to be eliminated.

5. User-defined notifications

A useful feature in Check_MK's notification system is the one with which users – even without administrator authority – can customise notifications. You can:

  • Add notifications that you wouldn't normally receive (“subscribe”)
  • Delete notifications that you would normally receive (if not restricted)
  • Customise notification parameters
  • Completely deactivate your alarms temporarily

User-defined rules

For the user, access is via the personal settings . Here the button is found, with which one can create new rules with the button.

Apart from one small difference, user-defined rules are almost the same as the normal rules: They (naturally) contain no contact selection. The user is automatically selected as their own contact. A user can only add or delete their own notifications in this way.

The user can only delete notifications if in the rule that generates them the allow users to deactivate this notification option has been activated:

Concerning the sequencing of rules – the user rules always follow the global rules and they can modify the already created notification table. Apart from the prohibition of deletions – as just described – the global rules accordingly apply as the default setting, but these settings can also be customised by the user.

If you wish to completely prohibit customisation you can revoke the user's General Permissions ➳ Edit personal notification settings Authorization.

As the administator you can display all user rules by using the button:

You can edit these with .

Disabling notifications temporarily

The complete disablement of notifications by a user is prevented by the Disable all personal notifications Permission, which is by default off. Only if you add this right to the user's role will they have the relevant check box available in their personal settings:

As an administrator with access to the user's personal settings, you can carry out disablement actions on the user's behalf – even if the permission as described above is not present. This can be found in the user profile's attributes. With this, for example, you can very quickly silence a holidaying colleague's notifications – without needing to alter the actual configuration.

6. When exactly notifications are generated

6.1. Introduction

A large part of the Check_MK notification system's complexity is due to its numerous tuning options, with which unimportant notifications can be avoided. Most of these will be situations in which event notifications are already being delayed or suppressed when they occur. Additionally, the monitoring core has a built-in intelligence that suppresses certain alarms by default. We would like to address all of these aspects in this chapter.

6.2. Planned downtimes

When a host or service is in a scheduled downtime the object's notifications will be suppressed. This is – alongside a correct evaluation of availabilities – the most important reason for the actual provision of downtimes in monitoring. The following details are relevant to this:

  • If a host is flagged as having a planned downtime, then all of its services will also be automatically in planned downtime – without an explicit entry for them needing to be entered.
  • Should an object enter a problem state during a planned downtime, when the downtime ends as planned this problem will be retroactively notified precisely at the end of the downtime.
  • The beginning and the end of a planned downtime is itself an event which will be notified.

Objects in a scheduled downtime will be flagged with a light blue crescent moon icon . The services of hosts in scheduled downtimes will be marked with a dark blue crescent moon icon .

6.3. Notification periods

You can define a notification period for each host and service during configuration. This is a time period which defines the time frame within which the notification should be constrained.

The configuration is performed using the Monitoring Configuration ➳ Notification period for hosts, or respectively the ... services rule set. An object that is not currently in a notification period will be flagged with a white moon icon .

Events on an object that is not currently in its notification period will not be notified. Such notifications will be 'reissued' when the notification period is again active – if the host/service is still in a problem state. Only the latest state will be notified even if multiple changes to the object's state have occurred during the time outside the notification period.

Incidentally, in the notification rules it is also possible to restrict a notification to a specific time period. In this way you can additionally restrict the time ranges. However, notifications that have been discarded due to a rule with time conditions will not automatically be repeated later!

6.4. The state of the host on which a service is running

If a host has completely failed, or is at least inaccessible to the monitoring, then obviously its services can no longer be monitored. Active checks will then as a rule register CRIT or UNKNOWN, since these will be actively attemping to access the host and will thereby run into an error. In such a situation all other checks - thus the great majority – will be omitted and will thus remain in their old state. These will be flagged with the stale time icon .

It would naturally be very cumbersome if all active checks in such a state were to notify their problems. For example, if a webserver is not reachable – and this has already been notified – it would not be very helpful to additionally generate an email for every single one of its dependent HTTP-services.

To minimise such situations, as a basic principle the monitoring core only generates notifications for services if the host is in the UP state. This is also the reason why host accessibility is separately verified. If not otherwise configured, this verification will be achieved with a Ping.

If you are using the Raw Edition (or the Enterprise Edition with a Nagios core), in isolated cases it can nonetheless occur that a host problem generates a notification for an active service. The reason for this is that Nagios regards the results of host checks as still being valid for a short time into the future. If even only a few seconds have elapsed between the last successful PING on the server and the next active checks, Nagios can still assess the host as UP even though it is in fact DOWN. In contrast, the CMC will hold the service notification in a 'waiting' mode until the host state has been verified, thus reliably minimising undesired notifactions.

6.5. Parent hosts

Imagine that an important network router to a company location with hundreds of hosts fails. All of its hosts will then be unavailable to the monitoring and become DOWN. Hundreds of notifications will therefore be triggered. Not good.

In order to avoid such problems the router can be defined as a parent host for its hosts. If there are redundant hosts, multiple parents can also be defined. As soon as all parents enter a DOWN state, the now inaccessible hosts will be flagged with the UNREACH state and their notifications will be suppressed. The problem with the router itself will of course still be notified.

The CMC operates internally in a slightly different manner to Nagios by the way. In order to reduce false alarms, but still process genuine alarms, it pays very close attention to the exact time of the relevant host check. If a host check fails the core will wait for the result of the host check on the parent host before generating an alarm. This wait is asynchronous and has no effect on the general monitoring. Notifications from hosts can thereby be subject to minimal delays.

6.6. Disabling notifications using rules

With the Monitoring configuration ➳ Enable/disable notifications for hosts, or respectively, the ... for services rule sets you can specify hosts and services for which generally no notifications are to be issued. As mentioned above the core then suppresses notifications. A subsequent notification rule that “subscribes” to notifications for such services will be ineffective!

6.7. Manually suppressing notifications

It is also possible to temporarily disable notifications for individual hosts or services manually:


Such hosts or services will then be marked with a icon. Since commands – in contrast to rules – require neither configuration permissions nor an Activate changes, they can be a quick workaround with which the operations can react to a situation.

Important: In contrast to scheduled downtimes , disabled notifications have no influence on the availability evaluations. If during an unplanned outage you really only want to disable the notifications without wishing to distort the availability statistics, you should not register a scheduled downtime!

6.8. Disabling notifications globally

A master switch for notifications can be found in the Master control:

This switch is incredibly useful if you plan to make bigger system changes, during which an error could under the circumstances force many services into a critical state. You can use the switch to avoid upsetting your colleagues with a flood of useless emails. Remember to re-enable the notifications when you are finished.

Each instance in distributed monitoring has one of these switches. Switching off the master instance's notifications still allows slaves to activate notifications – even though these are directed centrally to the master and displayed there.

Notifications that would have been triggered during the time when notifications were disabled will not be repeated later when the notifications are re-enabled.

6.9. Delaying notifications

You may possibly have services that occasionally enter a problem state for short periods, but the stops are very brief and are not critical for you. In such cases notifications are very annoying, but easily suppressed. The rule sets Monitoring configuration ➳ Delay host notifications and Delay service notifications serve this situation.

You specify a time in minutes here – and a notification will be delayed until this time has expired. Should the OK / UP-state again be achieved no notification will be triggered. Naturally this also means that the notification of a genuine problem will be delayed.

Obviously even better than delaying notifications would be the elimination of the actual cause of the sporadic problems – but that is of course another story...

6.10. Repeated check attempts

Another very similar method for delaying notifications is to allow multiple check attempts when a service enters a problem state. This is achieved with the {{Monitoring configuration|Maximum number of check attempts for hosts}}, or respectively, the ... services rule sets.

If you set a value of 3 here, for example, a check with a CRIT result will at first not trigger an alarm. This is referred to as a soft CRIT-state. The hard-state remains OK. Only if three successive attempts return a not-OK-state will the service switch to the hard state, and an alarm be triggered.

In contrast to delayed notifications, here we have the option of defining views so that such problems are not displayed. BI-Aggregate can also be constructed so that only hard states are included - not soft ones.

6.11. Flapping hosts and services

When a host or service frequently changes its state over a short time it is regarded as flapping. This is an actual state. The principle here is the reduction of excessive notifications during phases when a service is not (quite) running stably. Such phases can also be specially evaluated in the Availability statistics.

Flapping objects are marked with the icon. As long as an object is flapping, successive state changes trigger no further notifications. A notification will however be triggered whenever the object enters or leaves the flapping state.

The system's recognition of flapping can be influenced in the following ways:

  • The Master control has a main switch for controlling the detection of flapping (Flap detection).
  • You can exclude objects from detection by using the Monitoring configuration ➳ Enable/disable flapping detection for hosts, or respectively, the ... services rule sets.
  • In the  Check_MK Enterprise Edition, using the Monitoring core ➳ Tuning of flap detection global option you can define the parameters for flapping detection and set them to be more or less sensitive.

Please see in the online help for details about the values that can be set.

6.12. Periodically repeated notifications and escalation

For systems with a high sevice level it can make sense not to leave it to a single notification when a problem persists over a longer time frame. Check_MK can be set up so that successive notifications are issued at fixed intervals, as long as:

  • either the problem is acknowledged
  • or the problem is solved.

The setting for this can be found in the Monitoring configuration ➳ Periodic notifications during host problems, or respectively, the ... service problems rule sets:

Once this option is active, for a persistent problem Check_MK will trigger regular notifications at the configured intervals. These notifications will receive an incrementing number beginning with '1'.

Periodic notifications are not only useful for reminding about a problem (and annoying the operator), they also provide a basis for Escalations – meaning that after a defined time a notification can be escalated to other recipients.

To set up an escalation, create a supplementary notification rule which uses the Restrict to nth to mth notification condition. Enter '3 ... 99999' as the range for the sequential number so that the rule takes effect after the third notification. The escalation can then be performed either by selecting another method, (e.g., SMS), or it can notify other persons (contact selection).

With the Throttle periodic notifications option, after a given time the rate of notification repetition can be reduced so that, for example, on the first day an email can be sent every two hours, and later this can be reduced to one email per day.

7. The course of a notification from beginning to end

7.1. Overview

To help in the correct understanding of the contexts for all of the various setting options and basic conditions, and to enable an accurate problem diagnosis when a notification appears or does not appear as expected, here we will describe all of the particulars in the process of a notification.

The following components are involved:

Component Function Log file
Nagios The monitoring core in the  Check_MK Raw Edition. The core detects events and generates Raw notifications. var/log/nagios.log
var/nagios/debug.log
CMC The Check_MK Micro Core is the core of the  Check_MK Enterprise Edition and it performs the same function as Nagios in the CRE. var/log/cmc.log
Notification module The Notification module processes the notification rules in order to create a real notification from a raw notification. It calls up the notification scripts. var/log/notify.log
Notification spooler The notification spooler (only in the  Check_MK Enterprise Edition) provides the asynchronous delivery of notifications, and centralised notifications in distributed environments. var/log/mknotifyd.log
Notification script For every notification method there is a Script which processes the actual delivery (e.g., generate and send an HTML-email). var/check_mk/notify.log

7.2. The monitoring core

Raw notifications

As described above, every notification begins with an event in the monitoring core. If all Conditions have been satisfied and a 'green light' for a notification can be given, the core generates a Raw notification to the internal check-mk-notify Help Contact. The raw alarm doesn't yet contain details of the actual contacts or of the notification method.

The raw notification looks like this in the service's monitoring history:

  • The symbol is a light-grey loudspeaker
  • check-mk-notify is given as the contact.
  • check-mk-notify is given as the notification command.

The raw notification then passes to the Check_MK notification module, which processes the notification rules. This module is called up as an external program by Nagios (cmk --notify). The CMC on the other hand keeps the module on standby as a permanent auxiliary process (Notification helper), thus reducing process creation and saving machine time.

Error diagnosis in the Nagios monitoring core

The Nagios core used in the  Check_MK Raw Edition logs all Events to var/log/nagios.log. This file is simultaneously the location where it stores the monitoring history – which is also queried using the GUI if, for example, you wish to see a host's or service's notifications.

More interesting however are the messages you receive in the var/nagios/debug.log file when you set the debug_level variable to 32 in etc/nagios/nagios.d/logging.cfg.

Following a core restart...

OMD[mysite]:~$ omd restart nagios

... you will find useful information on the reasons notifications were created or suppressed:

var/nagios/debug.log
[1479122384.210217] [032.0] [pid=17610] ** Service Notification Attempt ** Host: 'heute', Service: 'CPU utilization', Type: 0, Options: 0, Current State: 2, Last Notification: Th
[1479122384.210247] [032.0] [pid=17610] Notification viability test passed.
[1479122384.667768] [032.0] [pid=17610] 1 contacts were notified.  Next possible notification time: Mon Nov 14 12:19:44 2016
[1479122384.667785] [032.0] [pid=17610] 1 contacts were notified.

Error diagnosis in the CMC monitoring core

In the Check_MK Enterprise Edition you can find a protocol from the monitoring core in the var/log/cmc.log log file. In the standard installation this file contains no information regarding notifications. You can however activate a very detailed logging function with the Monitoring Core ➳ Logging of the notification mechanics global option. The core will then provide information on why an event prompts it to pass – or not (yet) pass – a notification to the notification system:

OMD[mysite]:~$ tail -f var/log/cmc.log
2015-05-20 10:01:10 [5] Hard state change on 10.1.1.199;Interface 00002 to CRITICAL
2015-05-20 10:01:10 [5] Setting up notification for 10.1.1.199;Interface 00002, problem id: 47, delay: 0sec
2015-05-20 10:01:10 [5] Checking notification for 10.1.1.199;Interface 00002
2015-05-20 10:01:10 [5]  Postponing: service is currently in downtime

Please note that this can sometimes generate a lot of messages. It is however useful when one later asks why a notification was not generated in a particular situation.

7.3. Rule evaluation using the notification module

Once the core has generated a raw notification, this runs through the chain of notification rules – resulting in a table of notifications. Alongside the data from the raw notification, every notification contains the following additional information:

  • The contact to be notified
  • The notification method
  • The parameters for this method

In a synchronous delivery, for every entry in the table an appropriate notification script will now be executed. In an asynchronous delivery a notification will be passed as a file to the notification spooler.

Analysis of the rule chain in WATO

When you create more complex rule regimes the question of which rules will apply to a specific notification will certainly come up. For this Check_MK provides a built-in analysis function which is accessed using the button in the Notifications WATO module.

In the analysis mode the last ten raw notifications generated by the system and processed through the rules will be displayed:

For each of these ten raw notifications two actions will be available to you:

This action tests the rule chain, in which every rule will be checked if all conditions for the rule have been satisfied for the selected event. The resulting table of notifications will be displayed with the rules.
This action repeats this raw notification as if it has just appeared. Otherwise the display is the same as in the analysis. With this you can not only check the rule's conditions, but also test how a notification looks visually.

The notification module's log file

A further important possibility for diagnosis is the var/log/notify.log log file. During tests with notification the popular tail -f command is available:

OMD[mysite]:~$ tail -f var/log/notify.log
2015-05-20 09:53:52 ----------------------------------------------------------------------
2015-05-20 09:53:52 Got raw notification (10.1.1.199;Interface 00012) context with 60 variables
2015-05-20 09:53:52 Global rule 'Notify all contacts of a host/service via HTML email'...
2015-05-20 09:53:52  -> matches!
2015-05-20 09:53:52    - adding notification of hh via mail
2015-05-20 09:53:52 Executing 1 notifications:
2015-05-20 09:53:52   * notifying hh via mail, parameters: (no parameters), bulk: no
2015-05-20 09:53:52      executing /omd/sites/mysite/share/check_mk/notifications/mail

The Notifications ➳ Notification log level global option controls this file's details in two steps. Set this to Full dump of all variables and command, and in the log file you will find a complete listing of all of the variables available to the notification script.

For example, this will appear as this (extract):

var/log/notify.log
2016-11-14 15:02:23 ----------------------------------------------------------------------
2016-11-14 15:02:23 Got raw notification (myserver123;Check_MK) context with 69 variables
2016-11-14 15:02:23 Raw context:
                    CONTACTS=
                    HOSTACKAUTHOR=
                    HOSTACKCOMMENT=
                    HOSTADDRESS=127.0.0.1
                    HOSTALIAS=myserver123
                    HOSTATTEMPT=1
                    HOSTCHECKCOMMAND=check-mk-host-smart

7.4. Asynchronous delivery via the notification spooler

Synchronous or asynchronous

A powerful supplementary CEE function is the Notification spooler. This enables an asynchronous delivery of notifications. What does asynchronous mean in this context?


Synchronous delivery The notification module waits until the notification script has finished processing. Should this require a longer execution time subsequent notifications will queue up. If the monitoring is stopped these notifications will be lost. It is also possible that many notifications over a short time span can build up a queue back to the core causing the monitoring to falter.
Asynchronous delivery Every notification will be saved to a spool file under var/check_mk/notifify/spool. No jam can build up. If the monitoring is stopped the spool files will be retained and notifications can later be delivered correctly. The notification spooler takes over the processing of the spool files.

A synchronous delivery is then feasible if the notification script runs quickly, and above all can't get into some sort of timeout. With notification methods that access existing spoolers that is a given. Spool services from the system can be used particularly with email and SMS. The notification script passes a file to the spooler – with this precedure no wait state can occur.

When altering the traceable delivery per SMTP or other scripts which establish network connections, you should always employ asynchronous delivery. This also applies to scripts which send HTTP Text messages (SMS) over the internet. The timeouts when building a connection to a network service can take up to several minutes, causing a jam as described above.

Configuring asynchronous delivery

First, verify that the notification spooler (mknotifyd) is aktive. This should be displayed in omd status:

OMD[mysite]:~$ omd status
mkeventd:       running
liveproxyd:     running
mknotifyd:      running
rrdcached:      running
cmc:            running
apache:         running
crontab:        running
-----------------------
Overall state:  running

If the mknotifyd is missing, it can be activated with:

OMD[mysite]:~$ omd -f config set MKNOTIFYD on

The second step is to activate the asynchronous delivery. For this use the global setting Notifications ➳ Notification spooling with the option Asynchronous local delivery by notification spooler:

From Version 1.4.0i3 the notification spooler is always active and can never be switched off. Asynchronous delivery is thus the default setting for newly-created instances.

Error diagnosis

The notification spooler maintains its own log file: var/log/mknotifyd.log. This possesses three log levels which can be set in the Notifications ➳ Notification spooler configuration ➳ Verbosity of logging global option. By default only 'start', 'end' and error messages are logged. In the middle level, the processing of the spool files can be seen:

var/log/mknotifyd.log
2016-11-14 15:25:53 [5] -----------------------------------------------------------------
2016-11-14 15:25:53 [5] Check_MK Notification Spooler version 1.2.8p14 starting
2016-11-14 15:25:53 [5] Log verbosity: 2
2016-11-14 15:25:53 [5] Daemonized with PID 9815.
2016-11-14 15:26:20 [6] Detected updated configuration by WATO.
2016-11-14 15:26:20 [7] Reading configuration file /omd/sites/mysite/etc/check_mk/mknotifyd.d/wato/global.mk
2016-11-14 15:27:17 [6] Processing spoolfile: /omd/sites/mysite/var/check_mk/notify/spool/8db5dfd8-3f93-474a-9e48-22945af71fd4
2016-11-14 15:27:17 [7] process result <-1> of file /omd/sites/mysite/var/check_mk/notify/spool/8db5dfd8-3f93-474a-9e48-22945af71fd4
2016-11-14 15:27:44 [6] Processing spoolfile: /omd/sites/mysite/var/check_mk/notify/spool/f58df405-0011-46f8-a981-73a607d11705
2016-11-14 15:27:44 [7] process result <-1> of file /omd/sites/mysite/var/check_mk/notify/spool/f58df405-0011-46f8-a981-73a607d11705

8. Bulk notifications

Everyone who works with monitoring has experienced an isolated problem setting off a veritable flood of (successive) notifications. The principle of the parent hosts is a way of reducing these under specific circumstances, but unfortunately it doesn't help in all cases.

We can take an example from the Check_MK-Projekt itself: Once each day we build Check_MK installation packages for every supported Linux-Distribution. Our own Check_MK-monitoring is set up so that we have a service that is then only OK if the right number of packages have been correctly constructed. It can occasionally happen that a general error in the software hampers the packaging, causing 43 services to go into a CRIT state simultaneously.

Our bulk notification is so configured that in such a case only a single email listing all 43 notifications in sequence will be sent. This is naturally clearer than 43 single emails, and it also reduces the risk that 'in the heat of the battle' one misses a 44th email belonging to quite another problem.

The mode of operation of the bulk notification is very simple. When a notification occurs, at first it will be held back for a short time. Subsequent notifications that occur during this time will be immediately added into the same email. This collecting can be defined for each rule. So, for example, during the day you can operate with individual emails, but overnight with a bulk notification. If a bulk notification is activated you will generally be offered the following options:

The waiting time can be configured as desired. In many cases one minute suffices as by then at the latest all related problems should have appeared. You can of course set a longer time, but that will result in a fundamental delay to the notifications.

Since it naturally makes no sense to throw everything into a single pot, you can specify which groups of problems should be notified collectively. The Host option is very commonly used – this ensures that only notifications from the same host are bundled.

Here are a few additional facts about bulk notifications:

  • If the bundling is activated in a rule, the activation can be deactivated by a subsequent rule – and vice versa.
  • The bulk notification always takes place per contact. Each has their own private collection pot in effect.
  • You can limit the size of the pot. Once the set number is reached the bulk notification will immediately be sent.
  • The notification method must support bulk notifications. This is currently only the case for ASCII email and HTML email.

Bulk notifications and time periods

What happens when a notification is within the notification period, but the bulk notification that contains it – and which comes somewhat later – is outside the notification period? The reverse situation is also possible...

Here a very simple principle applies: all configurations that restrict notifications to time periods are valid only for the actual notification. The subsequent bulk notification will always be delivered independently of all time periods.

9. Traceable delivery per SMTP

9.1. Email is not reliable

Monitoring is only useful when one can rely on it. This requires that notifications are received reliably and promptly. Unfortunately email delivery is not completely ideal however. The despatch is usually processed by passing the email to the local SMTP-server. This attempts to deliver the email autonomously and asynchronously.

With a temporary error (e.g., a case where the receiving SMTP-server is not reachable) the email will be put into a queue and a later a new attempt will be made. This ‘later’ will as a rule be after 15-30 minutes. By then the notification could be far too late!

If the mail really can't be delivered the SMTP-server creates a nice error message in its log file and attempts to generate an error mail to the ‘sender’. But the monitoring system is not a real sender and also cannot receive emails. It follows that such errors simply disappear and notifications are then absent.

9.2. Using SMTP on a direct connection enables error analysis

From  Check_MK Enterprise Edition Version 1.4.0i2, Check_MK provides the possibilty of a traceable delivery via SMTP. This it intentionally does without the help of the local mailserver. Instead Check_MK itself sends the email to your smart host via SMTP, and then it evaluates the SMTP response itself.

In this way, not only are SMTP-errors treated intelligently, but a correct delivery is also precisely documented. It is a bit like a registered letter: Check_MK receives a receipt from the SMTP-smart host (receiving server) verifying that the email has been accepted – including a Mail-ID.

You can see this exactly documented in the affected service's history. Here is an example in which a service – for testing purposes – was manually set to CRIT. The screenshot below shows the view:

Three separate steps can be seen:

  1. The monitoring core generates a raw notification .
  2. The rules evaluation results in a notification to user hh with the mail method.
  3. The email was successfully received by the smart host – which answered with 250 - 2.0.0 Ok: queued as 4A2B180676.

The execution of the notification's script and the response from the SMTP-server can also be seen in the notify.log:

var/log/notify.log
2016-11-07 13:51:13 Got spool file c8c1f33a (myserver123;CPU utilization) for local delivery via mail
2016-11-07 13:51:13      executing /omd/sites/mysite/share/check_mk/notifications/mail
2016-11-07 13:51:14      Output: success 250 - 2.0.0 Ok: queued as ECB7A82019

The Message-ID 4A2B180676 will appear in the smart host's log file. There – if you are concerned – you can investigate where the email has gotten to. In any case you can prove that, and when, it was correctly sent from Check_MK.

Let us repeat the test from above, but this time with a falsely-configured password for the SMTP-transfer to the smart host. Here the SMTP-error message from the smart host can clearly be seen: (535, '5.7.8 Error: authentication failed:')

What can be done about failed notifications? Again notifying by email is apparently not a good solution. Instead, Check_MK displays an explicit warning in the Tactical Overview:

Here you can:

  • Click on the text ... failed notifications for a list of the failed deliveries.
  • Click on the button to acknowledge these messages and to delete the notices.

Configuring asynchronous delivery

Please note that direct delivery per SMTP in error situations can lead to a notification script running for a very long time and lead to a timeout. For this reason you are strongly advised to use the notification spooler and to select an asynchronous delivery of notifications.

The conduct with repeatable errors (such as an SMTP-Timeout) can be defined per notification method in the global settings under {{Notifications|Notification spooler configuration}}:

Alongside an optional timeout (the default is 60 seconds) and a maximum number of retries, it can also be defined whether the script is permitted to run multiply in parallel and thus send multiple notifications (Maximum concurrent executions). If the script is very slow a parallel execution can make sense – however the script must be so-programmed that multiple executions run cleanly (and, for example, that the script doesn't reserve certain data for itself).

A multiple, parallel delivery over SMTP is unproblematic since the target server can manage multiple parallel connections. This is certainly not the case when delivering directly from SMS via a modem without an additional spooler, and here one should stick with setting 1.

SMS and other notification methods

A synchronous delivery including error messages and traceability has to date only been implemented for HTML-emails. How one can return an error status in a self-written notification script can be found in the section on writing your own scripts.

10. Notification in distributed systems

In distributed environments – i.e. those with more than a single Check_MK-Instance – the question arises: what should happen with notifications that are generated on remote instances?

In such a situation there are basically two possibilities:

  1. Local delivery
  2. Central delivery on the master system (only CEE)

Detailed information on this subject can be found in the article on distributed monitoring.

11. Notification scripts

11.1. Basic principle

Notification can occur in very manifold and individual ways. Typical cases are:

  • Transfer of notifications to a ticket, or external notification system
  • The sending of an SMS over various internet services
  • Automated telephone calls
  • Forwarding to a higher (master) monitoring system

For this reason Check_MK provides a very simple interface which enables you to write your own notification scripts. These can be written in any Linux-supported programming language – even though Shell, Perl and Python together have 95% of the “market”.

The standard scripts included with Check_MK can be found in share/check_mk/notifications. This directory is a component of the software and is not intended to be changed. Instead, save your own scripts in local/share/check_mk/notifications. Ensure that your scripts are executable (chmod +x ...). They will then be found automatically and made available for selection to the notification rules.

Should you wish to customise a standard script, simply copy it from share/check_mk/notifications to local/share/check_mk/notifications and there make your changes in the copy. If you retain the original name, your script will be substituted automatically for the standard version and no changes will need to be made to the existing notification rules.

A number of example scripts are included with the software in share/doc/check_mk/treasures/notifications. You can use these as templates for customisation. The configuation will generally take place directly in the script – tips covering this can be found there in the comments.

In the case of a notification your script will be called up with the instance user's permissions. In environment variables, (those that begin with NOTIFY_), it will receive all of the information about the affected host/service, the event, the contacts to be notified, and the parameters specified in the notification rule.

Texts that the standard version of the script writes (print, echo, etc.), will appear in var/log/notify.log.

11.2. Traceable notifications

In Version 1.4.0i2 the notification scripts have the option of using an exit code to communicate whether a replicable or final error has occurred:

Exitcode Function
0 The script was successfully executed.
1 A temporary error has occurred. The execution should after a short wait be repeatedly reattempted, up until the configured maximum number of attempts has been reached. Example: an HTTP-connection cannot be established with an SMS-service.
2 and higher A terminal error has occurred. The notification will not be reattempted. A notification error will be displayed in the GUI. The error will be displayed in the host's/service's history. Example: the SMS-service records an “Invalid Authentification” error.

Additionally, in all cases the standard version of the notification script, together with the status will be entered in the host's/service's monitoring history and will therefore be visible in the GUI.

The treatment of notification errors from the user's point of view will be explained in the chapter on traceable delivery per SMTP.

11.3. A simple example

As an example we will write a script that writes all of the information for an alarm to a file. As the coding language here we will use the Linux shell (BASH):

local/share/check_mk/notifications/foobar
#!/bin/bash
# Foobar Teleprompter

env | grep NOTIFY_ | sort > $OMD_ROOT/tmp/foobar.out
echo "Successfully written $OMD_ROOT/tmp/foobar.out"
exit 0

We then make the script executable:

OMD[mysite]:~$ chmod +x local/share/check_mk/notifications/foobar

Here are a couple of explanations concerning the script:

  • In the first line is a #! and the path to the script language's interpreter (here /bin/bash).
  • In the second line after the comment character # is a Title for the script. As a rule this will be shown when selecting the notification method.
  • The env command will output all environment variables received by the script.
  • With grep NOTIFY_ the Check_MK variables will be filtered out...
  • ... and sorted alphabetically with sort.
  • > $OMD_ROOT/tmp/foobar.out writes the result to the tmp/foobar.out file within the instance.
  • The exit 0 would actually be superfluous in this location since the shell always takes the exit code from the last command. Here this is echo and is always successful – but explicit is always better.

Test run

So that our script will be used we must define it as a method in a notification rule. Self-written scripts have no parameter declaration, therefore all of the check boxes such as those offered, for example, in HTML Email, will be missing. Instead the user can enter a list of texts as parameters that can be available as NOTIFY_PARAMETER_1, etc, to the script. For our test we will provide the parameters Fröhn, Klabuster and Feinbein:

Now to test, we will set the service CPU load on the host myserver123 to CRIT. In notify.log we will see the script executing, and also its single line of output: “Successfully written...”:

var/log/notify.log
2016-11-15 12:30:49 executing /omd/sites/mysite/local/share/check_mk/notifications/foobar
2016-11-15 12:30:49 Output: Successfully written /omd/sites/mysite/tmp/foobar.out

The file tmp/foobar.out will now contain an alphabetic list of all Check_MK-environment variables that include information concerning the notification. Here you can orient yourself with which values are available to your script. Here are the first ten lines:

OMD[mysite]:~$ head tmp/foobar.out
NOTIFY_CONTACTALIAS=Harri Hirsch
NOTIFY_CONTACTEMAIL=mk@mathias-kettner.de
NOTIFY_CONTACTNAME=hh
NOTIFY_CONTACTPAGER=
NOTIFY_CONTACTS=hh
NOTIFY_DATE=2016-11-15
NOTIFY_HOSTACKAUTHOR=
NOTIFY_HOSTACKCOMMENT=
NOTIFY_HOSTADDRESS=127.0.0.1
NOTIFY_HOSTALIAS=myserver123

Our parameters will also be provided:

OMD[mysite]:~$ grep PARAMETER tmp/foobar.out
NOTIFY_PARAMETERS=Fröhn Klabuster Feinbein
NOTIFY_PARAMETER_1=Fröhn
NOTIFY_PARAMETER_2=Klabuster
NOTIFY_PARAMETER_3=Feinbein

11.4. Environment variables

In the above example we have seen a number of environment variables that will be passed to the script. Precisely which variables will be available will depend on the alarm and also on the Check_MK-version and edition being used. Alongside the trick with the env there are two further ways of getting a complete list of all variables:

  • Changing up the log levels for notify.log in the global settings
  • For notifications per HTML email there is a check box Information to be displayed in the email body with the option Complete variable list (for testing).

Below is a list of the most important variables:

OMD_ROOT Home directory for the instance, e.g., /omd/sites/mysite
OMD_SITE The instance name, e.g., mysite
NOTIFY_WHAT For host notifications, the word HOST, otherwise SERVICE. With these you can make your script so intelligent that it logs useful information in both cases.
NOTIFY_CONTACTNAME User name (Login) for the contact to be notified.
NOTIFY_CONTACTEMAIL Email address of the contact to be notified.
NOTIFY_CONTACTPAGER Entry in the Pager field in the contact's user profile. Since the field is not generally reserved for a specific purpose, you can simply use it for each user in order to save information required for notifications.
NOTIFY_DATE Date of the notification in ISO-8601-Format, e.g., 2016-11-15.
NOTIFY_LONGDATETIME Date and time in the non-localised Linux system's default display, e.g., Tue Nov 15 12:31:06 CET 2016.
NOTIFY_SHORTDATETIME Date and time in ISO-Format, e.g., 2016-11-15 12:31:06.
NOTIFY_HOSTNAME The name of the affected host in the monitoring.
NOTIFY_HOSTOUTPUT Output from the host check plug-in (e.g., “Packet received via smart PING”. This output is only relevant for host notifications, but is also present in service notifications.
NOTIFY_HOSTSTATE One of the words: UP, DOWN or UNREACH
NOTIFY_NOTIFICATIONTYPE The notification type (see in the introduction to this article). This will be expressed by one of the following words:

PROBLEM - Normal host or service problem
RECOVERY - Host/Service is again UP / OK
ACKNOWLEDGEMENT (...) - acknowledgement of a problem
FLAPPINGSTART - A Host/Service has begun flapping
FLAPPINGSTOP - Flapping has ended
DOWNTIMESTART - Start of a planned maintenance.
DOWNTIMEEND - Normal end of a maintenance
DOWNTIMECANCELLED - Premature interruption of a maintenance
CUSTOM - A notification issued by a manual command
ALERTHANDLER (...) - An alert handler execution (CEE from 1.4.0i2)

For types with (...), the brackets contain additional informationen on the notification's type.

NOTIFY_PARAMETERS All of the script's parameters separated by blanks.
NOTIFY_PARAMETER_1 The script's first parameter.
NOTIFY_PARAMETER_2 The script's second parameter, etc.
NOTIFY_SERVICEDESC The name of the service being notified. This variable is not present in host notifications.
NOTIFY_SERVICEOUTPUT The service check's check plug-in's output (not for host notifications)
NOTIFY_SERVICESTATE One of the words: OK, WARN, CRIT or UNKNOWN

11.5. Bulk notifications

If your script should support bulk notifications, it will need to be specially prepared, since the script must deliver multiple notifications simultaneously. For this reason a delivery using environment variables also doesn't function practicably.

Give your script a name in the third line in the header as below – the notification module will then send the notifications through the standard input:

local/share/check_mk/notifications/mybulk
#!/usr/bin/python
# My Bulk Notification
# Bulk: yes

Through the standard input the script will receive blocks of variables. Each line has the form: NAME=VALUE. Blocks are separated by blank lines. The ASCII-character with the code 1 (\a) is used to represent newlines within the text.

The first block contains a list of generall variables (e.g., call parameters). Each subsequent block assembles the variables into a notification.

The best recommendation is to try it yourself with a simple test that writes the complete data to a file so that you can see how the data is sent. This can be done as below:

local/share/check_mk/notifications/mybulk
#!/usr/bin/python
# My Bulk Notification
# Bulk: yes

cat > $OMD_ROOT/tmp/mybulktest.out

12. Files and directories

12.1. Paths from Check_MK

Pfad Function
var/log/cmc.log The CMC log file. If notification debugging ist activated, here you will find precise information as to why notifications were, or were not generated.
var/log/notify.log The notification module's log file.
var/log/mkotifyd.log The notification spooler's log file.
var/log/mkotifyd.state The current status of the notification spooler. This is primarily relevant for distributed notifications.
var/nagios/debug.log The Nagios debug log file. Switch on the debug messages in the variable debug_level in etc/nagios/nagios.d/logging.cfg.
var/check_mk/notify/spool/ Storage location for the spool files to be processed by the alarm spooler.
var/check_mk/notify/deferred/ With temporary errors the notification spooler moves the files to here and retries after a couple of minutes.
var/check_mk/notify/corrupted/ Defective spool files will be moved to here.
share/check_mk/notifications Notification scripts supplied as standard with Check_MK. Make no changes here.
local/share/check_mk/notifications Storage location for your own notification scripts. If you wish to customise a standard script, copy it from share/check_mk/notifications to here, and retain the original file name.
share/doc/check_mk/treasures/notifications Here are a number of notification scripts which you can slightly customise and use.

12.2. The SMTP-service's log files

The SMTP-service's log files are system files and their absolute paths are listed here below. Precisely where the log files are stored will depend on your distribution.

Pfad Function
/var/log/mail.log The SMTP-server's log file under Debian and Ubuntu
/var/log/mail The SMTP-server's log file under SUSE LINUX (SLES)
/var/log/maillog The SMTP-server's log file under Red Hat