Distributed Monitoring

Last updated: January 15. 2017


1. Introduction

Probably not everybody has the same understanding of the term ‘Distributed Monitoring’. In fact monitoring is always distributed over multiple computers, unless the monitoring system is only monitoring itself – which wouldn't be very useful.

In this handbook we therefore always refer to a distributed monitoring when the monitoring system as a whole consists of more than a single Check_MK-instance. There are a number of good reasons for splitting monitoring over multiple instances:

  • Performance: The processor load should, or must be shared over multiple machines.
  • Organisation: Various different groups should be able to administer their own instances independently.
  • Availability: The monitoring at one location should function independently of other locations.
  • Security: Data streams between two security domains should be separately and precisely controlled (DMZ, etc.)
  • Network: Locations that have only narrow band or unreliable connections cannot be reliably remote-monitored.

Check_MK supports various procedures for implementing a distributed monitoring. Check_MK controls some of these as it is largely compatible with, or based on Nagios (if Nagios has been installed as the core). Also covered are the old NSCA process and the somewhat more modern mod_gearman. Compared to Check_MK's own system they offer no advantages and are also more cumbersome to implement. For these reasons we don't recommend them.

The precedure preferred by Check_MK is based on Livestatus and a division of the configuration using WATO. For situations with very separated networks, or even a strict one-way data transfer from the peripherie to the centre there is a method using Livedump, or respectively, CMCDump. Both methods can be combined.

2. Distributed monitoring with livestatus

2.1. Basic principle

Central status

Livestatus is an interface integrated into the monitoring core which enables other external programs to query status data and execute commands. Livestatus can be made available over the network so that it can be accessed by a remote Check_MK-instance. Check_MK's user interface uses livestatus to combine all tethered instances into a general overview. This then feels like a ‘large’ monitoring system.

The following diagram schematically shows the structure of a monitoring with livestatus distributed over three locations. The Check_MK-instance Master Site is found in the central processing site. From here central systems will be directly controlled. Additionally, there are the Slave Site 1 and Slave Site 2 instances which are located in other networks and controlled by their local systems:

What makes this method special is that the monitoring status of the slaves is not sent continuously to the master. The GUI always only retrieves data live from the remote instances when it is required by a user in the control centre. The data is then compiled into a centralised view. There is thus no central data holding, which means it offers huge advantages for scaling up!

Here are some of the advantages of this method:

  • Scalability: The monitoring itself generates no network traffic at all between master and slaves. In this way hundreds of locations, or more, can be connected.
  • Reliability: If a network connection to a slave fails the local monitoring nonetheless continues operating normally. There is no ‘hole’ in the data recording and also no data ‘jam’. A local notification will still function.
  • Simplicity: Instances can be very easily incorporated and removed.
  • Flexibility: The slave instances are still self-contained and can be used for the operating in their respective location. This is then particularly interesting if the ‘location’ should never be permitted to access the the rest of the monitoring.

Central configuration

In a system distributed using Livestatus as described above, it is quite possible that the individual instances can be independently maintained by different teams, and the master only has the task of providing a centralised dashboard.

In the case of multiple, or all instances needing to be administered by the same team, a central configuration is much easier to handle. Check_MK supports this and refers to such a configuration as a ‘distributed WATO’. With this all hosts and services, users and permissions, time periods, and notifications, etc., will be maintained centrally on the master using WATO, and then depending on their tasks, be automatically distributed to the slaves.

Such a system not only has a common status overview but also a common configuration, and effectively ‘feels like a large system’.

2.2. Installing a distributed monitoring

Installing a distributed monitoring using livestatus/distributed WATO is achieved in the following steps:

  1. First install the master instance as is usually done for a single instance
  2. Install slave instances, and enable livestatus via the network
  3. Integrate the slave instances into the master using the Distributed monitoring WATO-module
  4. For the hosts and services, specify from which instance they are to be monitored
  5. Execute a service discovery for the migrated hosts, and then activate the fresh changes

Installing a master instance

No special requirements are placed on the master. This means that a long-established instance can be expanded into a distributed monitoring without requiring additional modifications.

Installing slave instances and enabling livestatus via the network

The slave instances are then generated as new instances in the usual way with omd create. This will naturally take place on the (remote) server intended for the respective slave instance.

Special notes:

  • For the slave instances, use IDs unique to your distributed monitoring.
  • The slave's Check_MK-version is permitted to diverge from the master's version to a maximum of one patch level (denoted by the numeral following the ‘p’ for stable versions). Other versions may be compatible, but not necessarily. Information on the Check_MK-version numbering system can be found in its own article.
  • In the same way as Check_MK supports multiple instances on a server, slave instances can also run on the same server.

Here is an example for creating a slave instance with the name slave1:

root@linux# omd create slave1
Adding /opt/omd/sites/slave1/tmp to /etc/fstab.
Creating temporary filesystem /omd/sites/slave1/tmp...OK
Restarting Apache...OK
Created new site slave1 with version 1.2.8p12.

  The site can be started with omd start slave1.
  The default web UI is available at http://Klappfisch/slave1/
  The admin user for the web applications is omdadmin with password omd.
  Please do a su - slave1 for administration of this site.

The most important step is now to enable live status via TCP on the network. Please note that live status is not per se a secure protocol and should only be used within a secure network (secured LAN, VPN, etc.). The enabling appears per omd config as an instance user on a stopped site:

root@linux# ~# su - slave1
OMD[mysite]:~$ omd config

Now select Distributed Monitoring:

Set LIVESTATUS_TCP to 'on' and enter an available port number for LIVESTATUS_TCP_PORT that is explicit on this server. The default is 6557:

After saving, start the instance as normal with omd start:

OMD[slave1]:~$ omd start
Starting mkeventd...OK
Starting Livestatus Proxy-Daemon...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site slave1...OK
Starting xinetd...OK
Initializing Crontab...OK

Retain the default password for omdadmin temporarily. Once the slave has been subordinated to the master, all users will likewise be replaced by those from the master.

The slave is now ready. Verify with netstat which should show that Port 6557 is open. The connection to this port is performed by an instance of the auxiliary daemon xinetd, which runs directly in the instance:

root@linux# netstat -lnp | grep 6557
tcp        0      0 0.0.0.0:6557            0.0.0.0:*     LISTEN      10719/xinetd

Assigning slave instances to the master

The configuration of the distributed monitoring takes place exclusively on the master. The required WATO-module is Distributed monitoring, and this serves to manage the connections to the individual instances. For this function the master itself counts as an instance and is already present in the list:

Using , now define the connection to the first slave:

In the Basic settings it is important to use the slave instance's EXACT name – as defined with omd create – as the Site-ID. As always the alias can be defined as desired and also be later changed.

The Livestatus settings determine how the central instance queries the status of the slaves via live status. The example in the screenshot shows a connection with the Connect via TCP method. This is the optimal for stable connections with short latency periods (such as, eg., in a LAN). We will discuss the optimal settings for WAN connections later.

The URL prefix is required for integrating other applications (e.g., PNP4Nagios). We will come to this subject separately later. Enter the HTTP-URL to the slave's web interface here (only the part preceeding the check_mk/ component). If you basically access Check_MK per HTTPS, then substitute the http here with https. Further information can be found in the online help .

The use of Distributed WATO is, as we discussed in the introduction, optional. Activate this if you wish to configure the slave with and from the master. In such a case select the exact settings as shown in the image above.

A correct setting for the Multisite-URL of the remote site is very important. The URL must always end with /check_mk/. A connection with HTTPS is recommended, provided that the slave instance's Apache supports HTTPS. This must be manually installed on the slave at the Linux level. For the Check_MK Appliance, HTTPS can be set up using the web-based configuration interface. If you utilise a self-signed certificate, you will require the Ignore SSL certificate errors check box.

Once the mask has been saved a second instance will appear in the overview:

The (so far) empty slave's monitoring status is now correctly integrated. A Login to the slave's WATO is still required for the distributed WATO. To this end, via HTTP the master exchanges a randomly-generated password with the slave, through which all future communication will take place. The omdadmin access on the slave will subsequently no longer be used.

To login use the access data omdadmin and omd (or respectively, that of an administrator's account on the slave):

A successful login will be so acknowledged:

Should an error occur with the login, this could be due to a number of reasons – for example:

  1. The slave instance is currently stopped.
  2. The Multisite-URL of the remote site has not been correctly set up.
  3. The slave is not reachable under the host name ‘from Master’ specified in the URL.
  4. The Check_MK-versions of the master and the slave are (too) incompatible.
  5. An invalid user ID and/or password have been entered.

Points 1. and 2. can be easily tested, by manually calling the slave's URL in your browser.

When everything has been successful run Activate Changes. This will, as always, bring you to an overview of the not yet activated changes. Simultaneously it will also show the states of the livestatus connections, likewise the WATO-synchronisation states of the individual instances:

The Version column shows the Livestatus-Version of the respective site. When using the CMC as the core ( Check_MK Enterprise Edition), the core's version number (Column 'Core' ) is identical to that of the livestatus. If you are using Nagios as the core ( Check_MK Raw Edition), the Nagios version number will be seen here.

The following symbols show WATO's replication status:

This instance has outstanding changes. The configuration matches the master, but not all changes have been activated. With the Restart button a targetted activation for this instance can be performed.
The WATO-configuration for this instance is not synchronous and must be carried over. A restart will then of course be necessary to activate it. Both functions can be performed with the Sync & Restart button.

In the Status column the state of the livestatus connection for the respective instance can be seen. This is shown purely for information since the configuration is not transmitted via Livestatus, but rather over HTTP. The following values are possible:

The instance is reachable via Livestatus.
The instance is currently not reachable. Livestatus queries are running in a Timeout. This delays the page loading. Status data for this instance is not visible in the GUI.
The instance is currently not reachable, but this is due to the setting up of a statushost or is known through the Livestatus proxy (see below). The inaccessability does not lead to Timeouts. Status data for this instance is not visible in the GUI.
The livestatus connection to this instance has been temporarily deactivated by the (master's) administrator. The setting matches the ‘Temporarily disable this connection’ check box in the settings for this connection.

Clicking on the button will now synchronise all instances and activate the changes. This is achieved in parallel, so that the overall time equates to the time required by the slowest instance. Included in the time is the creation of a configuration snapshot for the respective instance, the transmission over HTTP, the unpacking of the snapshot on the slave, and the activation of the changes.

Important: Do not leave the page before the synchronisation has been completed on all instances – leaving the page will interrupt the synchronisation.

. Specifying to the hosts and folders which instance should monitor them

Once your distributed environment has been installed you can begin to use it. You actually only need to tell each host by which instance it should be monitored. The master is specified by default.

The required attribute for this is ‘Monitored on site’. You can set this individually for each host. This can naturally also be performed at the folder level:

. Executing a fresh service discovery and activating changes for migrated hosts

Adding hosts functions as usual – apart from the fact that the surveillance as well as the service discovery will be run from the respective slave instance, there are no special considerations.

When migrating hosts from one instance to another there are a couple of points to be aware of. Neither current nor historic status data from the host will be carried over. Only the host's configuration is retained in the WATO. In effect it is as if the host has been removed from one instance and freshly-installed on the other instance:

  • Automatically discovered services will not be migrated. Run a Service discovery after the migration.
  • Once restarted, hosts and services will show PEND. Currently existing problems may as a result be newly-notified.
  • Historic graphing will be lost. This can be avoided by manually moving the relevant RRD-files. The location of the files can be found in Files and directories.
  • Data for availability and from historic events will be lost. These are unfortunately not easy to migrate as the data consists of single lines in the monitoring log.

If the continuity of the history is important, when implementing the monitoring you should carefully plan which host is to be monitored, and from where.

2.3. Special features of a distributed setup

A monitoring distributed via livestatus behaves much like a single system, but has a couple of special characteristics:

Access to the monitored hosts

All accesses of a monitored host are consistently carried out from the instance to which the host is assigned. This applies not only to the actual monitoring, but also to the service discovery, the Diagnostics page, the Notifications, Alert handlers and everything else. This is very important as it has not been said that the master actually has access to this host.

Specifying the instance in views

Some of the standard views are grouped according the instance from which the host will be monitored – this applies for, e.g., All hosts:

The instance will likewise be shown in the host's or service's details:

This information is generally available for use in a column when creating your own views. There is also a filter with which a view of hosts on a specific site can be filtered:

Site status element

There is a Site status snap-in element for the side bar which can be added using . This displays the status of the individual instances, and it also provides the option of clicking on the status to temporarily hide or show individual sites. These will be flagged with the status. With this you can also disable a instance that is generating timeouts, thus avoiding superfluous timeouts:

This is not the same as disabling the livestatus connection using the connection configuration in WATO. Here the ‘disabling’ only affects the currently logged-in user and has a purely visual function. Clicking on an instance's name will display a view of all of its hosts.

Master control element

In a distributed monitoring the Master control element has a different appearance. Each instance has its own global switch:

Check_MK Cluster hosts

If you monitor with Check_MK HA-Cluster, the cluster's individual nodes must be assigned to the same instance as the cluster itself. This is because determining the clustered services' status accesses cache files generated through monitoring the node. This data is located locally on the respective instance.

Piggyback data (e.g., ESX)

Some check plug-ins use ‘Piggyback’ data, for example, for allocating monitoring data retrieved from an ESX-host to the individual virtual machines. For the same reason as with cluster monitoring, in distributed monitoring the ‘piggy’ (carrying) host as well as its dependent hosts must be monitored from the same instance. In the case of ESX this means that the virtual machines must be assigned to the same site in Check_MK as the ESX-System from which the monitoring data is collected. This can mean that it is better to poll the ESX-Hostsystem directly rather than to poll a global vCenter. Details for this can be found in the documentation on ESX-monitoring.

Hardware/software inventory

The Check_MK-inventory also functions in distributed environments. In doing so the inventory data from the var/check_mk/inventory directory must be regularly transmitted from the slaves to the master. For performance reasons the user interface always accesses this directory locally.

In the  Check_MK Enterprise Edition the synchronisation is carried out automatically on all sites that are connected using the Livestatus proxy.

If you run inventories using the  Check_MK Raw Edition in distributed systems, the directory must be regularly mirrored to the master with your own tools (e.g., with rsync).

Changing a password

Even when all instances are being centrally monitored, a login on an individual instance's interface is quite possible and often also appropriate. For this reason WATO ensures that a user's password is always the same for all sites.

A password change made by the administrator will take effect automatically as soon as it is shared to all instances with Activate Changes.

A change made by a user themselves using the sidebar in their personal settings works somewhat differently. This cannot execute an Activate changes since the user of course has no general authority for this function. In such a case WATO will automatically share the changed password across all instances – directly after it has been saved in fact.

As we all know, networks are never 100% available. If an instance is unreachable at the time of a password change, it will not receive the new password. Until the administrator successfully runs an Activate changes, or respectively, the next successful password change, this instance will retain the old password for the user. A status symbol will inform the user of the status of the password sharing to the individual instances.

2.4. Tethering existing instances

As mentioned above, existing instances can also be retrospectively tethered to a distributed monitoring. As long as the preconditions described above have been satisfied (compatible Check_MK-versions), this will be completed exactly as for setting up a new slave. Share livestatus with TCP, then add the instance to the Distributed monitoring module – and you're done!

The second stage – the changeover to a centralised configuration – is somewhat trickier. Before integrating the instance into the distributed WATO as described above, you should be aware that in doing so the instance's entire local configuration will be overwritten!

Should you wish to take over existing hosts, and possibly rules as well, three steps will be required:

  1. Match the host tags' scheme
  2. Copy the WATO-directories
  3. Edit the characteristics in the parent folder once

. Host tags

It is self-evident that the host tags used in the slave must also be known to the master in order that they can be carried over. Check these before the migration and add any missing tags to the master by hand. Here it is essential that the Tag-IDs match – the tag's title is irrelevant.

. WATO-directories

Next, move the hosts and rules into the central WATO on the master. This only works for hosts and rules in sub-directories (i.e., not in the ‘Main directory’). Hosts in the main directory should simply be moved into a slave's sub-directory first using WATO.

The actual migration can then be achieved quite simply by copying the appropriate directories. Each host directory in WATO corresponds to a directory within etc/check_mk/conf.d/wato/. These can be copied using a tool of your choice (e.g., scp) from the tethered site to the same location in the master. If a directory with the same name already exists there, simply rename it. Please note that Linux users and groups are also used by the master site.

Following the copying the hosts should appear in the master's central WATO – likewise the rules you have created in these folders. The folders' characteristics will also be included with the copying. These can be found in the folder in the hidden .wato file.

. One-time editing and saving

So that the attributes of the master's parent folders' functions are correctly inherited, as a final step following the migration the parent folders' characteristics must be opened and saved once – the host's attributes will thereby be freshly defined.

2.5. Instance-specific global settings

A centralised configuration over WATO means that first and foremost, all instances have a common and (apart from the hosts) the same configuration. What is the situation however, when individual instances require different global settings? An example could be the CMC setting Maximum concurrent Check_MK checks. It could be that a customised setting is required for a particularly small or a particularly large instance.

For such cases there is an instance-specific global setting. This is reached via the symbol in the Distributed monitoring WATO-module:

Over this symbol you will find a selection of all global settings – although anything you define here will only be effective for the chosen instance. A value that diverges from the standard will be visually-highlighted, and it will apply only to this instance:

Note: Site-specific settings for the master are only indirectly possible – since it is of course the master that predefines the configuration. In a situation where ONLY the master's settings diverge, for every other site it will be necessary to make site-specific settings to ‘RETURN’ them to the ‘default’.

2.6. Distributed event console

The Event Console processes syslog-messages, SNMP traps and other types of events of an asynchronous nature.

Up to version 1.2.8, in a distributed environment the recommended procedure is to operate only a single instance in the Event Console – and that one within the master instance. There you will manage all of the hosts' events.

This setup has the disadvantage that host's events must be sent to another instance, rather than the one by which they are being actively monitored. A consequence of this is that when generating notifications from the event console, the host's information is incomplete since the local Check_MK doesn't know them. On the one hand, this applies to the detection of hosts' contact groups, and on the other hand also to events in which the originating host is identified only by its IP-Adresse and a real host name is absent. In such a case notification rules containing conditions linked to the host names cannot function.

From Version 1.4.0i1 Check_MK provides the option of also running a distributed Event Console. Then every instance will run its own event processing which captures the events from all of the hosts being monitored from the instance. The events will thus not be sent to the central system, rather they will remain at the instances and be only centrally retrieved. This is effected in a similar way to the active states via Livestatus and functions with both the  Check_MK Raw Edition and the  Check_MK Enterprise Edition.

Converting to a distributed Event Console according to the new scheme requires the following steps:

  • Activate WATO-Replication to EC (Replicate Event Console configuration to this site) in the connection settings
  • Switch the Syslog location and SNMP-Trap-destinations for the affected hosts to the slave. This is the most laborious task.
  • If you use the Check event state in Event Console rule set, switch this back to Connect to the local Event Console.
  • If you use the Logwatch Event Console Forwarding rule set, switch this likewise to the local Event Console.
  • In the Event Console Settings, switch the Access to event status via TCP back to no access via TCP.

2.7. PNP4Nagios

In the  Check_MK Raw Edition the PNP4Nagios Open-Source-Projekt is used for displaying performance values graphically. This has its own web interface which is integrated in Check_MK. Using this, in some locations single graphics will be embedded, and in other locations a complete page including its own navigation will be provided:

In distributed monitoring the performance data bases (RRDs) are always located locally on the slave sites. This is very important because a continuous transmission of all performance data to the master – and its associated network traffic – is thus avoided. Furthermore all of the other advantages of a distributed monitoring through livestatus are retained, as described at the outset.

PNP4Nagios unfortunately has no compatible interface for accessing the graphs in livestatus. Therefore Check_MK simply retrieves the individual graphs, or respectively, the complete websites from PNP4Nagios per HTTP over its standard-URLs. Two methods are used for this:

  1. The PNP4Nagios-data is retrieved directly from the user's browser
  2. The PNP4Nagios-data is retrieved from the master and then forwarded to the user

. Retrieval via the user's browser

The first method is very simple to implement. For the relevant sites, configure the URL-prefix in the connection's attributes, and set it to the URL used for accessing this instance – though without the /check_mk/:

Check_MK will embed the graphs in the GUI so that the browser can retrieve the graphs' PNG-images, or respectively, the website's Iframes from PNP4Nagios over this URL. Specify the URL thus as it functions with the application's browser. An access to the slave from the master is not necessary.

The URL method as just described is quick and easy to set up, but it has a few small disadvantages:

  • Since the browser retrieves the PNP4Nagios-data from a different host to the Check_MK-GUI, a Check_MK-session cookie will not be sent. The user must thus make a new login for very slave instance. With the first access to a graph a login screen will appear.
  • The slave server may not in fact be reachable from the user's browser – rather only from the master. In such a case this method can't function.
  • The URL-prefix must be set to either http:// or to https://. A selection made by the user will then no longer function.

. Retrieval via the master

The best solution to this problem is to retrieve the PNP4Nagios-data from the master, rather than from the user's browser itself. To this end, create a proxy rule on the master's Apache-server. This will route PNP4Nagios queries per HTTP or HTTPS to the correct slave server. Important: this must be done on the operating system's Apache, not that running on the instance. For this reason a root-permission is required.

The prerequisite for this setup is that all Check_MK instance-IDs in your network are explicit, since Apache must use the Slave-ID to decide which server it should forward to.

Assuming the following example:

ID IP-Addresse Livestatus Check_MK URL
master 10.15.18.223 local http://10.15.18.223/master/check_mk/
slave1 10.1.1.133 Port 6557 http://10.1.1.133/slave1/check_mk/

In the connection settings, now simply set /slave1/ as the URL-prefix:

With this, queries to PNP4Nagios initially go to the master on the /slave1 URL. Should the slave1 instance coincidentally be running on the same server as the master, you will now be finished and no proxy rule will be required, since the data can be delivered directly.

In the general case that the slave runs on another host, you will require the root-permission and must create a configurations file for the system-wide Apache server. The path for this file will depend on your Linux distribution:

Distribution Path
RedHat, CentOS /etc/httpd/conf.d/check_mk_proxy.conf
SLES, Debian, Ubuntu /etc/apache2/conf.d/check_mk_proxy.conf

The file consists of five lines for each tethered slave instance. In the following example, substitute the instance name (here slave1) and the instance's URL (here http://10.1.1.133/slave1/). Please note that for Apache it is relevant whether a URL ends with a 'slash' or not:

/etc/apache2/conf.d/multisite_proxy.conf
<Location /slave1>
    Options +FollowSymLinks
    RewriteEngine On
    RewriteRule ^/.+/slave1/(.*) http://10.1.1.133/slave1/$1 [P]
</Location>

This rule tells Apache that all URLs beginning with /slave1 are to be retrieved via reverse-proxy from the URL http://10.1.1.133/slave1.

Important: don't forget to activate the configuration. For SLES, Debian and Ubuntu, perform this with:

root@linux# /etc/init.d/apache2 reload

RedHat and CentOS require:

root@linux# /etc/init.d/httpd reload

If everything has been done correctly, PNP4Nagios must now be able to access the graphs.

2.8. Logwatch

Check_MK includes the mk_logwatch plug-in with which under Linux and Windows you can monitor text log files, and especially the Windows event log. This plug-in provides a special webpage in the GUI in which the relevant detected messages can be viewed and acknowledged:

Up until Check_MK Version 1.2.8 this page required local access to the saved log messages. This installed the plug-in on the slave from which the respective server was monitored. In distributed monitoring however the master has no direct access to these files. The solution is the same as with PNP4Nagios: The slave servers' logwatch webpage is embedded and retrieved from the slave separately per HTTP.

The configuration required for this is identical to that used when setting up Check_MK for PNP4Nagios. If this has already been set up the Logwatch interface will automatically function correctly.

From Version 1.4.0i1 Check_MK the Logwatch webpage exclusively uses Livestatus for the transfer and no longer requires HTTP. The setting up of HTTP or the proxy rule is then only needed for users of the  Check_MK Raw Edition for PNP4Nagios.

2.9. NagVis

The NagVis open source program visualises status data from monitoring on self-produced maps, diagrams and other charts. NagVis is integrated in Check_MK and can be used immediately. The access is easiest over the NagVis Maps sidebar element. The integration of NagVis in Check_MK is described in its own article.

NagVis supports distributed monitoring via Livestatus in pretty much the same way as Check_MK does. The links to the individual sites are refferred to as backends. The backends are automatically set up correctly by Check_MK so that one can immediately begin generating NagVis-charts – also in distributed monitoring.

Select the correct backend for each object that you place on a chart – i.e., the Check_MK instance from which the object is to be monitored. NagVis cannot find the host or service automatically, above all for performance reasons. Therefore if you move hosts to a different slave you will need to update the NagVis-charts accordingly.

Details on backends can be found in the documentation here: NagVis.

3. Unstable or slow connections

The general status overview in the user interface enables an always available, and reliable access to all of the connected instances. The one snag with this is that a view can only be displayed when all instances have responded. The process is always that first a Livestatus query is sent (for example, “List all services whose state is not OK.”). The view can then only be displayed once the last instance has responded.

It's annoying when an instance doesn't answer at all. To tolerate brief outages (e.g., due to restarting a site or a lost TCP-Packet), the GUI waits for a given time before an instance is declared to be , and then continues processing the responses from the remaining sites. This results in a ‘hanging’ GUI. The timeout ist set to 10 seconds by default.

If this occasionally happens in your network you should set up either Status hosts or (even better) the Livestatus proxy.

3.1. Status hosts

The configuration of Status hosts is the recommended procedure with the  Check_MK Raw Edition in order to recognise defective connections reliably. The idea is simple: The master instance actively monitors the connection to each individual slave. At least we will then have a monitoring system available! The GUI will then be aware of unreachable instances and can immediately exclude and flag them as . Timeouts are thus minimised.

Here is how to set up a status host for a connection:

  1. Add the host on which the slave instance is running to the master in monitoring.
  2. Enter this as the status host in the connection to the slave:

A failed connection to a slave instance can now only lead to a brief hangup of the GUI – namely until the monitoring has recognised it. By reducing the status host's proof interval from the default of sixty seconds to, e.g., five seconds you can minimise the duration of a hangup.

If you have set up a status host, there are further possible states for connections:

The computer on which the slave instance is running is just now unreachable to the monitoring because a router is down (the status host has an UNREACH state).
The status host that monitors the connection to the slave system has not yet been verified by the monitoring (it still has a PEND state).
The status host's state has an invalid value (this should never occur).

In all three cases the connection to the instance will be excluded and timeouts thus avoided.

3.2. Persistent connections

With the Use persistent connections check box you can prompt the GUI to maintain established Livestatus connections to slave instances permanently in an 'up' state, and to continue using them for queries. Especially for connections with longer packet turnarounds (e.g., Intercontinental), this can make the GUI noticeably more responsive.

Because the Apache GUI is shared over multiple independent processes a connection is required for each Apache-Client process running simultaneously. If you have many simultaneous users, please ensure the configuration has a sufficient number of Livestatus connections in the slave's Nagios core. These are configured in the etc/mk-livestatus/nagios.cfg file. The default is twenty (num_client_threads=20).

By default, Apache is so configured in Check_MK that it permits up to 128 simultaneous user connections. This is configured in the following section of the etc/apache/apache.conf file:

etc/apache/apache.conf
<IfModule prefork.c>
StartServers         1
MinSpareServers      1
MaxSpareServers      5
ServerLimit          128
MaxClients           128
MaxRequestsPerChild  4000
</IfModule>

This means that under high load up to 128 Apache processes can start which then also generate and sustain up to 128 Livestatus connections. Not setting the num_client_threads high enough can result in errors or a very slow response time in the GUI.

For connections with LAN or with fast WAN-Networks we advise not utilising persistent connections.

3.3. The livestatus proxy

With the Livestatusproxy the  Check_MK Enterprise Edition features a sophisticated mechanism for detecting dead connections. Additionally, this especially optimises the performance of connections with long round-trip-times. The livestatus proxy's advantages are:

  • Very fast, proactive detection of unresponding instances
  • Local caching of queries that deliver static data
  • Standing TCP-connections – which require fewer round trips and consequently allow much faster responses from distant instances (e.g. USA ⇄ China)
  • Precise control of the maximum number of livestatus connections required
  • Enables Hardware/Software inventory in distributed environments

Installation

Installing the livestatus proxy is very simple. It is activated by default in the CEE – which can be seen when starting a site:

OMD[master]:~$ omd start
Starting mkeventd...OK
Starting Livestatus Proxy-Daemon...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site slave1...OK
Starting xinetd...OK
Initializing Crontab...OK

Select the setting ‘Use Livestatus Proxy-Daemon’ for the connection to the slaves instead of ‘Connect via TCP’:

The details for host and port are as always. No changes must be made on the slave. In Number of channels to keep open enter the number of parallel TCP-connections the proxy should establish and sustain to the target site.

The TCP-connections pool is shared by all GUI enquiries. The number of connections limits the maximum number of queries that can be processed concurrently. This indirectly limits the number of users. In situations in which all channels are reserved this will not immediately lead to an error. The GUI waits a given time for a free channel. Most queries actually require only a few milliseconds.

If the GUI must wait longer than Timeout waiting for a free channel for a channel, it will be interrupted with an error and the user will receive an error message. In such a case the the number of connections should be increased. Be aware however that on the remote (the slave) sufficient parallel incoming connections must be allowed – this is set to 20 by default. This setting can be found in the global options under Monitoring core ➳ Maximum concurrent Livestatus connections.

The Regular heartbeat provides a constantly active monitoring of the connections directly at the protocol level. In the process the proxy regularly sends a simple Livestatus query which must be answered by the slave within the predetermined time (default: 2 seconds). With this method a situation where the target server and the TCP-port are actually reachable, but the monitoring core no longer responds, will also be detected.

If a response fails to appear, all connections will be declared ‘dead’, and following a ‘cooldown’ time (default: 4 seconds) will be newly established. All this takes place proactively – i.e. without a user needing to open a GUI-window. In this way outages can be quickly detected, and via a recovery the connections can be immediately reestablished and in the best case be available before a user even notices the outage.

The Caching ensures that static queries need only be responded-to once by the slave, and from that point of time can be responded to directly and locally, without delay. An example of this is the list of monitored hosts required by Quicksearch.

Error diagnosis

The Livestatus proxy has its own log file, which can be found under var/log/liveproxyd.log. On a correctly-configured slave with five channels (standard) it will look something like this:

var/log/liveproxyd.log
2016-09-19 14:08:53.310197 ----------------------------------------------------------
2016-09-19 14:08:53.310206 Livestatus Proxy-Daemon starting...
2016-09-19 14:08:53.310412 Configured 1 sites
2016-09-19 14:08:53.310469 Removing left-over unix socket /omd/sites/master/tmp/run/liveproxy/slave1
2016-09-19 14:08:53.310684 Channel slave1/5 successfully connected
2016-09-19 14:08:53.310874 Channel slave1/6 successfully connected
2016-09-19 14:08:53.310944 Channel slave1/7 successfully connected
2016-09-19 14:08:53.311009 Channel slave1/8 successfully connected
2016-09-19 14:08:53.311071 Channel slave1/9 successfully connected

The Livestatus proxy regularly records its state in the var/log/liveproxyd.state file:

var/log/liveproxyd.state
Current state:
[slave1]
  State:                   ready
  Last Reset:              2016-09-19 14:08:53 (125 secs ago)
  Site's last reload:      2016-09-19 14:08:45 (134 secs ago)
  Last failed connect:     1970-01-01 01:00:00 (1474287059 secs ago)
  Cached responses:        1
  Last inventory update:   1970-01-01 01:00:00 (1474287059 secs ago)
  PID of inventory update: None
  Channels:
      5 - ready             -  client: none - since: 2016-09-19 14:10:38 ( 20 secs ago)
      6 - ready             -  client: none - since: 2016-09-19 14:10:43 ( 15 secs ago)
      7 - ready             -  client: none - since: 2016-09-19 14:10:48 ( 10 secs ago)
      8 - ready             -  client: none - since: 2016-09-19 14:10:53 (  5 secs ago)
      9 - ready             -  client: none - since: 2016-09-19 14:10:33 ( 25 secs ago)
  Clients:
  Heartbeat:
    heartbeats received: 24
    next in 0.2s

And when an instance is currently stopped the state will look like this:

var/log/liveproxyd.state
----------------------------------------------
Current state:
[slave1]
  State:                   starting
  Last Reset:              2016-09-19 14:12:54 ( 10 secs ago)
  Site's last reload:      2016-09-19 14:12:54 ( 10 secs ago)
  Last failed connect:     2016-09-19 14:13:02 (  2 secs ago)
  Cached responses:        0
  Last inventory update:   1970-01-01 01:00:00 (1474287184 secs ago)
  PID of inventory update: None
  Channels:
  Clients:
  Heartbeat:
    heartbeats received: 0
    next in -5.2s

Here the state is 'starting'. The proxy is thus attempting to establish connections. There no channels yet. During this state queries to the site will be answered with an error.

4. Livedump and CMCDump

4.1. Motivation

The concept for a distributed monitoring with Check_MK that has been described up until now is a good and simple solution in most cases. It does however require network access from the master to the slaves. There are situations in which access is either not possible or not desired, because, for example:

  • the slaves are in your customer's network for which you have no access
  • the slaves are in a security area to which access is strictly forbidden
  • the slaves have no permanent network connection and no fixed IP-addresses

Distributed monitoring with Livedump, or respectively, CMCDump takes a quite different approach. Firstly, the slaves are so attached so that they operate completely independently of the master and are administered decentrally. A distributed WATO will be dispensed with.

All of the slave's hosts and services will then be replicated as copies in the master. Livedump/CMCDump can help by generating a copy of the slaves' configuration which can then be loaded into the master.

Now during the monitoring, on every slave a copy of the current status will be written to a file at predetermined intervals (e.g., every minute). This will be transmitted to the master via a user-defined method and will be saved there as a status update. No particular protocol has been provided or specified for this data transfer. Any automatable transfer protocol could be used. It is not essential to use scp – even a transfer by email is conceivable!

Such a setup differs from a ‘normal’ distributed monitoring in the following ways:

  • Actualisation of the states and performance data in the master will be delayed.
  • Calculation of availability on the master will give minimally different results to a calculation on the slave.
  • State changes that occurr more quickly than the actualisation interval will be invisible to the master.
  • If a slave is ‘dead’, the states will become obsolete on the master – the services will be ‘stale’, but nonetheless still visible.

Performance and availability data for this time period will be ‘lost’ (but they will still be available on the slave).

  • Commands on the master such as Downtimes and Acknowledgements cannot be transmitted to the slave.
  • The master can never access the slaves.
  • Access to logfile details by Logwatch is impossible.
  • The Event Console will not be supported by Livedump/CMCDump.

Since brief state changes – depending on the periodic interval selected on the master – may not be visible, a notification through the master is not ideal. If however the master is utilised as a purely display instance – as a central overview of all customers for example – this method definitely has its advantages.

Incidentally, Livedump/CMCDump can be used simultaneously alongside distributed monitoring over Livestatus without problems. Some instances are are simply connected via Livestatus directly – others use Livedump. Livedump can also be added to one of the Livestatus slaves.

4.2. Installing Livedump

If you are installing the  Check_MK Raw Edition (or the CEE with a Nagios core), use the livedump tool. The name is derived from Livestatus and Status-Dump. From the Check_MK Version 1.2.8p12 livedump is located directly in the search path and is thus available as a command. In older versions you can find it under ~/share/doc/check_mk/treasures/livedump/livedump.

We will make the following assumptions...

  • ... the slave instance has been fully set up and is actively monitoring hosts and services
  • ... the master instance has been started and is running
  • ... at least one host is being locally monitored on the master (because the master monitors itself).

Transferring the configuration

First, on the slave, create a copy of its host's and service's configurations in Nagios-configuration format. Also redirect the output from livedump -TC to a file:

OMD[slave1]:~$ livedump -TC > config.cfg

The start of the file will look something like this:

nagios.cfg
define host {
    name                    livedump-host
    use                     check_mk_default
    register                0
    active_checks_enabled   0
    passive_checks_enabled  1

}

define service {
    name                    livedump-service
    register                0
    active_checks_enabled   0
    passive_checks_enabled  1
    check_period            0x0

}

Transmit the file to the master, (e.g., with scp) and save them there in the ~/etc/nagios/conf.d/ directory – here Nagios expects to find the configuration data for hosts and services. Select a file name that ends with .cfg, for example ~/etc/nagios/conf.d/config-slave1.cfg. If an SSH-access from slave to master is possible it can be done, for example, as below:

OMD[slave1]:~$ scp config.cfg master@mymaster.mydomain:etc/nagios/conf.d/config-slave1.cfg
master@mymaster.mydomain's password:
config.cfg                                             100% 8071     7.9KB/s   00:00

Now log in to the master and activate the changes:

OMD[master]:~$ cmk -R
Generating configuration for core (type nagios)...OK
Validating Nagios configuration...OK
Precompiling host checks...OK
Restarting monitoring core...OK

Now all of the slave's hosts and services should appear in the master instance – initially with the PEND state, which they will retain for the time being:

Note:

  • With the -T option in livedump template definitions are created in Livedump from which it draws the configuration. Without these Nagios cannot be started. Only one of these may be present however. If you import a configuration from another slave it must not use the -T option!
  • A dump of the configuration is also possible on a CMC-core, the importion of which requires Nagios. If the CMC is running on your master use CMCDump.
  • The copying and transferring of the configuration must be repeated for every change to hosts or services on the slave.

Transferring the status

Once the hosts are visible in the master, we will need to setup a (regular) transmission of the slaves' monitoring status. Again create a file with livedump, but this time without secondary options:

OMD[slave1]:~$ livedump > state

This file contains the states of all hosts and services in a format which Nagios can read directly from check results. The start of this file looks something like this:

state
host_name=myserver666
check_type=1
check_options=0
reschedule_check
latency=0.13
start_time=1475521257.2
finish_time=1475521257.2
return_code=0
output=OK - 10.1.5.44: rta 0.005ms, lost 0%|rta=0.005ms;200.000;500.000;0; pl=0%;80;100;; rtmax=0.019ms;;;; rtmin=0.001ms;;;;

Copy this file to the master into the ~/tmp/nagios/checkresults directory. Important: This file's name must begin with c and be seven characters long. With scp it will look something like this:

OMD[slave1]:~$ scp state master@mymaster.mydomain:tmp/nagios/checkresults/caabbcc
master@mymaster.mydomain's password:
state                                                  100%   12KB  12.5KB/s   00:00

Finally, create an empty file on the master with the same name and the .ok extension. With this Nagios will know that the status file has been transferred completely and can now be read in:

OMD[master]:~$ touch tmp/nagios/checkresults/caabbcc.ok

The status of the slaves' hosts/services will now be immediately updated on the master:

The transmission of the status must from now on be made regularly. Livedump unfortunately doesn't support this task and you will need to script it yourself. The livedump-ssh-recv script can be found in ~/share/check_mk/doc/treasures/livedump, which you can employ in order to receive Livedump updates (including those from the configuration) on the master per SSH. Details about this can be found in the script itself.

The configuration and staus dump can also be restricted by using Livestatus filters. For example, you could limit the hosts to the members of the mygroup hostgroup:

OMD[slave]:~$ livedump -H "Filter: host_groups > mygroup" > state

Further information on Livedump – in particular how to transfer the data via encrypted email – can be found in the README file in the ~/share/doc/check_mk/treasures/livedump directory.

4.3. Implementing CMCDump

CMCDump is for the Check_MK Micro Core what Livedump is for Nagios – and it is thus the tool of choice for the  Check_MK Enterprise Edition. In contrast to Livedump, CMCDump can replicate the complete status of hosts and services (Nagios doesn't have the required interfaces for this task).

To compare: Livedump transfers the following data:

  • The current states – i.e. PEND, OK, WARN, CRIT, UNKNOWN, UP, DOWN or UNREACH
  • The output from Check plug-ins
  • The performance data

CMCDump additionally synchronises:

  • The long output from the plug-in
  • Whether the object is currently
  • The time stamps for the last check execution and the last state change
  • The duration of the check execution
  • The latency of the check execution
  • The sequence number of the current check attempt and whether the current state is ‘hard’ or ‘soft’
  • acknowledged, if present
  • Whether the object is currently in a planned maintenace.

This provides a much more precise reflection of the monitoring. When importing the status the CMC doesn't just simulate a check execution, rather by using an interface designed for this task it transmits an accurate status. Among other things, this means that at any time the operations centre can see whether problems have been acknowledged or if maintenance times have been entered.

The installation is almost identical to that for Livedump, but is however somewhat simpler since there is no need to be concerned about possible duplicated templates or similar.

The copy of the configuration is made with cmcdump -C. Store this file on the master in etc/check_mk/conf.d/. The .mk file extension must be used:

OMD[slave1]:~$ cmcdump -C > config.mk
OMD[slave1]:~$ scp config.mk master@mymaster.mydomain:etc/check_mk/conf.d/slave1.mk

Activate the configuration the master:

OMD[master]:~$ cmk -O

As with Livedump the hosts and services will now appear on the master in the PEND state. You will however see by the symbol that we are dealing with a shadow object. In this way it can be distinguished from an object being monitored directly on the master or on a ‘normal’ slave instance:

The regular generation of the status is achieved with cmcdump without additional arguments:

OMD[slave1]:~$ cmcdump > state
OMD[slave1]:~$ scp state master@mymaster.mydomain:tmp/state_slave1

To import the status to the master the file content must be written into the tmp/run/live UNIX-Socket with the help of the unixcat tool.

OMD[master]:~$ unixcat tmp/run/live < tmp/state_slave1

If you have a connection from the slave to the master via SSH without a password all three commands can be combined into a single one – and when so doing not even a temporary file is created:

OMD[slave1]:~$ cmcdump | ssh master@mymaster.mydomain "unixcat tmp/run/live"

It really is so simple! But, as already mentioned, ssh/scp is is not the only method for transferring files, and a configuration or status can be transferred just as well using email or another desired protocol.

5. Notifications in distributed environments

5.1. Centralised or decentralised?

In a distributed environment the question arises – from which instance should the notifications (e.g., emails) be sent: from the individual slaves or from the master? There are arguments in favour of both procedures.

Arguments for sending from the slaves:

  • Simpler to set up
  • A local notification is still possible if the link to the master is not available
  • Also works with the  Check_MK Raw Edition

Arguments for sending from the master:

  • Notifications can be further processed at a central location (e.g., be forwarded to a ticket system)
  • Slave instances require no setting up for email or SMS
  • For sending an SMS over hardware this is only required once – on the master

5.2. Decentralised notification

No special steps are required for a decentralised notification since this is the standard setting. Every notification that is generated on a slave instance runs through the chain of notifications rules there. If you implement a distributed WATO these rules are the same on all instances. Notifications resulting from these rules will be delivered as usual, for which the appropriate notification scripts will have been run locally.

It must simply be ensured that the appropriate service has been correctly installed on the instances – that a smart host has been defined for emails for example – in other words the same procedure as for setting up an individual Check_MK-Instance.

5.3. Centralised notifications

Fundamentals

The  Check_MK Enterprise Edition provides a built-in mechanism for centralised notifications which can be individually activated for each slave instance. Such slaves then route all notifications to the master for further processing. The centralised notification is thereby independent of whether the distributed monitoring has been set up in the standard way, or with CMCDump, or by using a blend of these procedures. Technically speaking, the central notification server does not even need to be the ‘master’. This task can be taken on by any Check_MK-instance.

If a slave instance has been set to ‘forwarding’, all notifications wiil be forwarded directly to the master as they would be from the core – quasi in a raw format. Once there the notification rules will be evaluated which actually decide who should be notified and how. The required notification scripts will be invoked on the master.

Activating the alarm spooler

The first step for implementing centralised notification is to activate the notification spooler (mknotifyd) on all participating instances. This is an auxiliary process that is required on the master as well as on the slaves. In newer Check_MK-versions the notification spooler is automatically aktivated. Please verify this with omd config and activate it if needed. This point can be found under Distributed Monitoring ➳ MKNOTIFYD.

An omd status must show the mknotifyd process:

OMD[mysite]:~$ omd status
OMD[master]:~$ omd status
mkeventd:       running
liveproxyd:     running
mknotifyd:      running
rrdcached:      running
cmc:            running
apache:         running
crontab:        running
-----------------------
Overall state:  running

Only when the notification spooler is active will the point Notifications ➳ Notification spooling be found under the global settings in WATO.

Setting up the TCP-connections

The slave and (notification-)master notification spoolers communicate with each other via TCP. Notifications are sent from slave to master. The master acknowledges to the slaves that the notifications have been received, which prevents notifications being lost even if the TCP connection is broken.

There are two alternatives for the construction of a TCP-connection:

  1. A TCP-connection is configured from master to slave. Here the slave is the TCP-server.
  2. A TCP-connection is configured from slave to master. Here the master is the TCP-server.

Consequently there is nothing standing in the way of forwarding notifications if for network reasons establishing connections is only possible in a specific direction. The TCP-connections are supervised by the spooler with a heartbeat signal and are immediately reestablished as needed – not only in the event of a notification.

Since slave and master require different global settings you must make site specific settings for all slaves. Configuring the master is performed using the normal global settings. This is due to Check_MK currently not supporting any specific settings for the local instance (= Master instance). Please note – these settings will be automatically inherited by all slaves for which no specific settings have been defined.

Let's look first at an example where the master establishes the TCP-connections to the slaves.

Step 1: On the slave, edit the instance specific global setting Notifications ➳ Notification Spooler Configuration and activate Accept incoming TCP connections. TCP-Port 6555 will be recommended for incoming connections. If there are no objections, adopt these settings.

Step 2: Now, likewise, in the Notification Spooling submenu only on the slave, select the option {{Forward to remote site by notification spooler}}.

Step 3: Now, on the master – i.e. in the normal global settings – configure the connection to the slave (and then to additional slaves as needed):

Step 4: Set the global setting Notification Spooling to Asynchronous local delivery by notification spooler, so that the master's communications will also be processed over the same central spooler.

Step 5: Activate the changes.

Establishing connections from a slave

If the TCP-connection should be established from the slave outwards, the procedure is identical, differing only from the description above by simply exchanging the roles of master and slave.

A blend of the two procedures is also possible. In such a case the master must be installed so that it listens to incoming connections as well as connecting to slave instances. However in every master/slave relationship only one of the pair is permitted to establish the connection!

Test and diagnose

The alarm spooler logs to the var/log/mknotifyd.log file. In the spooler configuration the loglevel can be raised so that more messages are received. With a standard loglevel one should see something like this on the master:

var/log/mknotifyd.log
2016-10-04 17:19:28 [5] -----------------------------------------------------------------
2016-10-04 17:19:28 [5] Check_MK Notification Spooler version 1.2.8p12 starting
2016-10-04 17:19:28 [5] Log verbosity: 0
2016-10-04 17:19:28 [5] Daemonized with PID 31081.
2016-10-04 17:19:28 [5] Successfully connected to 10.1.8.44:6555

At all times the var/log/mknotifyd.state file contains the current status of the spooler and all of its connections:

master:var/log/mknotifyd.state (Auszug)
Connection:               10.1.8.44:6555
Type:                     outgoing
State:                    established
Status Message:           Successfully connected to 10.1.8.44:6555
Since:                    1475594368 (2016-10-04 17:19:28, 140 sec ago)
Connect Time:             0.000 sec

A version of the same file is also present on the slave. There the connection will look something like this:

slave:var/log/mknotifyd.state (Auszug)
Connection:               10.22.4.12:56546
Type:                     incoming
State:                    established
Since:                    1475594368 (2016-10-04 17:19:28, 330 sec ago)

To test, select any monitored slave service and set it manually to CRIT with the Fake check results command.

Now on the master an incoming notification should appear in the notifications log file (notify.log):

master:var/log/notify.log
2016-10-04 17:27:57 ----------------------------------------------------------------------
2016-10-04 17:27:57 Got spool file 68c30b35 (myserver123;Check_MK) from remote host for local delivery.

The same event will look like this on the slave:

slave:var/log/notify.log
2016-10-04 17:27:23 ----------------------------------------------------------------------
2016-10-04 17:27:23 Got raw notification (myserver123;Check_MK) context with 71 variables
2016-10-04 17:27:23 Creating spoolfile: /omd/sites/slave1/var/check_mk/notify/spool/f3c7dea9-0e61-4292-a190-785b4aa46a64

In the global settings, as well as the normal notifications log (notify.log) you can also alter the notification spooler's log to a higher loglevel.

Monitoring the spooling

Once you have set up everything as described you will notice that on the master, and respectively on the slaves, a new service will be found that must definitely be taken into the monitoring. This monitors the alarmspooler and its TCP-connections. Every connection will thereby be monitored twice: once by the master, and once by the slave:

6. Files and directories

6.1. Configurations files

Path Function
etc/check_mk/multisite.d/sites.mk Here WATO stores the configuration for the connections to the individual instances. If the interface ‘hangs’ due to an error in the configuration, so that it becomes inoperable, you can edit the disruptive entry directly in the file. If the livestatus proxy is activated however, it will subsequently be necessary to edit and save at least one connection over WATO, since only with this action will a suitable configuration be generated for this daemon.
etc/check_mk/liveproxyd.mk Configuration for the Livestatus proxy. This file will be freshly generated by WATO with every alteration in the configuration of a distributed monitoring.
etc/check_mk/mknotifyd.d/wato/global.mk Configuration for the notification spooler. This file will be generated by WATO when saving the global settings.
etc/check_mk/conf.d/distributed_wato.mk This is generated on the slaves by the distributed WATO and it ensures that the slave only monitors its own hosts.
etc/nagios/conf.d/ Storage location for customer-created Nagios-configurations files with hosts and services. These are required for the use of Livedump on the master.
etc/mk-livestatus/nagios.cfg The configuration of Livestatus for the use of Nagios as the core. Here you can configure the allowed maximum number of simultaneous connections.
etc/check_mk/conf.d/ The configuration of hosts and rules for Check_MK. Store configurations files that are generated by CMCDump here. Only the wato/ subdirectory is managed by, and will be visible in WATO.
var/check_mk/autochecks/ For services found by the service discovery. These are always stored locally on the slave.
var/check_mk/rrds/ Location of the Round-Robin-Database for archiving the performance data when using the Check_MK-RRD-format (the default with the  Check_MK Enterprise Edition)
var/pnp4nagios/perfdata/ Location of the Round-Robin-Database with the PNP4Nagios-format ( Check_MK Raw Edition)
var/log/liveproxyd.log Log file for the Livestatus proxies.
var/log/liveproxyd.state The current state of the Livestatus proxies in a readable form. This file is updated every 5 seconds.
var/log/notify.log Log file for the Check_MK-notification system.
var/log/mknotifyd.log Log file for the notification spooler.
var/log/mknotifyd.state The current state of the notification spooler in a readable form. This file is updated every 20 seconds.