1. The Checkmk micro core's special features
In comparison with Nagios, the CMC's most significant advantages are its higher performance and faster reaction times. It has further interesting advantages of which you should be aware - the most important of which are:
- Smart-Ping - Intelligent host checks
- Check-Helper and Checkmk-Helper#checkhelper
- Initial Scheduling
- Processing of performance data
- A couple of Nagios' functions are not used, or are achieved using a different procedure.
2. Smart-Ping - Intelligent host checks
With Nagios the availability of hosts is usually verified using a "Ping". Additionally, for every host a plug-in such as check_Ping or check_icmp is executed once per interval (generally once per minute). This sends, for example, five ping packets and waits for their return. Creating the processes for executing the plug-ins consumes valuable CPU resources. Furthermore, these are jammed for quite a long time if a host is not reachable, and long timeouts must be endured.
By contrast, CMC runs checks - unless otherwise defined - using a process called Smart-Ping. With this a single ping packet is sent per host per minute. This is performed by an auxiliary process named icmpsender - directly and without a process creation. For this purpose the icmpsender has its own ping implementation, which we have developed and which is markedly efficient. In benchmarking using only Ping we could generate network traffic of up to 45 MBit/sec! The interval for pings per host is set to six seconds by default, but you can redefine this via the Normal check interval for host checks rule set.
The responses to the pings are not explicitely waited-for. Incoming ping packets are simply noted as successful checks. The auxiliary process icmpreceiver is responsible for this. If no packet is received from a host within a defined time, this host will be flagged as DOWN. This timeout time is predefined at 15 seconds (2,5 intervals) and can also be altered per Host in WATO via the Settings for host checks via Smart-Ping rule set.
2.1. No on-demand host checks
Host checks not only serve to trigger alarms in the case of a total host failure, but also to suppress alarms for service problems during this down time. It is therefore important to quickly and accurately determine a host's condition in the event of a service problem. Even if it has not been long since the last host check, this result can already be out of date. It is additionally sufficient to first run a service check directly after a random host failure. The host will then still be viewed as UP and an alarm for the service will be - spuriously - sent.
The CMC solves this problem very simply: if during a service problem the host is in an UP state, then the next host check will simply be waited-for. Due to the interval being very short at only six seconds, there is only a negligible delay to an alarm if the host is UP. The CMC uses an additional trick - the icmpreceiver evaluates not only incoming ping responses, but also packets resulting from TCP connection establishments (SYN and RST).
As an example, let's take the case of a check_http Plug-in delivering a CRIT status, due to a queried webserver being unavailable. In this situation, following the start of the service check a TCP-RST-packet will be received from the server (Connection Refused). The CMC therefore knows for certain that the host itself is UP and it can thus send an alarm without delay.
2.2. The advantages
This procedure yields a number of advantages:
- Virtually insignificant CPU consumption by host checks: even without particularly powerful hardware umpteen thousand hosts can be monitored.
- No thwarting of monitoring by on-demand host check jams if hosts are DOWN.
- No false alarms from services when a host status is not current.
One disadvantage should not be hushed-up however: the Smart-Ping host checks generate no performance data (packet runtimes). On hosts where these are required you can simply set up a Ping-Check (via the Active Checks ➳ Check hosts with PING rule set).
2.3. Unpingable hosts
Not all hosts are checkable in practice. For these cases other methods in CMC can also be used for the host check, e.g., a TCP connection. Because these are generally exceptions they have no negative impact on the overall performance. The rule set here is Monitoring Configuration ➳ Host Check Command.
2.4. Problems with firewalls
There are firewalls that respond with TCP RST to TCP connection packets on inaccessible hosts. The trick is that the firewall is not permitted to register itself as the sender of this packet, rather the target host's IP address must be specified. Smart-Ping will view this packet as a sign of life and incorrectly assume that the target host is accessible.
In such a (rare) situation, via the global setting Monitoring Core ➳ Tuning of Smart-Ping you have the possibility of activating the Ignore TCP Reset Packets option. Or, with check_icmp you can select a conventional ping as a host check for the affected hosts.
3. Check-Helper and Checkmk-Helper
A lesson from the poor performance of Nagios in larger environments is that creating processes is an expensive operation. The size of the parent process is primarily decisive here. For every execution of an active check the complete Nagios process must first be duplicated (forked) before it is replaced by the new process - the check plug-in. The more hosts and services to be monitored, the larger this process and the fork respectively takes longer. In the meantime the core's other tasks must wait - and here 24 CPU cores are not much help.
In order to avoid forking the core, during the program start the CMC creates a fixed number of very lean auxiliary processes whose task is to start the active check plug-ins: the Check-Helpers. Not only do these fork much more quickly, but the forking also scales-up to cover all available cores because the core itself is no longer blocked. In this way, the execution of active checks (e.g. check_http) - whose runtimes are actually quite short - is greatly accelerated.
The CMC goes a significant step further however - because in a Checkmk environment active checks are rather an exception. Here Checkmk-based checks - in which only a single fork per host and interval is required - are primarily used. Checkmk calculates the results from all services without further process-creation. The Checkmk plug-in itself does still require some CPU time, but this is mostly needed when starting the plug-in.
To reduce this the CMC starts a second class of auxiliary processes: die Checkmk-Helper. Every helper can execute a parallel Checkmk-Helper for a host. Because the process runs continuously and repeatedly, the overhead arising from a fork and loading the program is eliminated. In comparison to the combination of Nagios + Checkmk this has two pleasing consequences: the CPU time for a check execution is reduced by a factor of 15! Furthermore, the host check precompilation is dispensed with, which in turn accelerates configuration changes.
3.1. Correctly setting the number of helpers
As standard, 20 check helpers and 5 Checkmk helpers are started. It therefore follows that a maximum of 5 Checkmk-Checks can run simultaneously. Should you wish to monitor a larger number of hosts it is possible that that these are not sufficient. This will be primarily the case if you have hosts with slower response times (e.g. monitoring VMWare ESX or slow-responding SNMP devices). The Micro Core Statistics sidebar element shows the helper's utilisation percentage averaged over the last 10-20 seconds:
The numbers can be easily set in the Monitoring Core's global settings.
4. Initial scheduling
During scheduling it is defined which checks should be run at what times. Nagios has implemented numerous procedures that should ensure that the checks are regularly distributed over the interval. It will likewise attempt to distribute the queries to be run on an individual target system uniformly over the interval.
The CMC has its own, simpler procedure for this purpose. This takes into account that Checkmk already contacts a host once per interval. Furthermore, the CMC ensures that new checks are immediately executed and not distributed over several minutes. This is very convenient for the user since a new host will be queried as soon as the configuration is activated. In order to avoid a large number of new checks causing a load spike, new checks whose number exceeds a definable limit can be distributed over the entire interval. The relevant point Initial Scheduling can be found in the global settings.
5. Processing of performance data
An important Checkmk function is the Processing of performance data -
such as, e.g. CPU utilisation, and its retention for a long time period.
The software performs two functions:
- 1. the creation and maintenance of the round-robin-databases
- 2. the graphic display of the data in the GUI
In a Nagios core operation the function mentioned in point 1. above is quite a long process. Depending on the method, spool data, Perlscripts and an auxiliary process written in C (npcd) can be used. Finally, slightly converted data is written to the RRD-Cache-Daemon's Unixsocket.
The CMC shortens this chain by writing directly to the RRD-Cache-Daemon - all intermediate steps are dispensed with. Parsing, and converting the data to the RRD-tool's format is performed directly in C++. This method is possible and sensible nowadays as the RRD-Cache-Daemon has already implemented its own very efficient spooling, and with the aid of journal files this means that no data is lost in the case of a system crash.
- Reduced Disk-I/O and CPU load
- Simpler implementation with markedly more stability
The installation of new RRDs is performed by the CMC with a further helper, activated by cmk --create-rrd. This creates files optionally compatible with PNP, or with the new Checkmk format (only for new installations). A switch from Nagios to CMC has no effect on existing RRD files - these will be seamlessly carried-over and will continue to be maintained.
The graphic display of the data in the GUI is handled directly by Checkmk's GUI