Performance data and graphing


1. Introduction

Along side the detection of problems, Check_MK is an excellent tool for the recording and analysis of diverse performance data which can accumulate in IT environments. This can include for example:

  • Operating system performance (disk IO, CPU and storage utilisation, ...)
  • Network statisctics (actual bandwidth, packet roundtrips, error rates, ...)
  • Environment sensors (temperature, humidity, air pressure, ...)
  • Usage statistics (logged-in users, page requests, sessions, ...)
  • Quality statistics from applications (e.g. website response times)
  • Electricity consumption and quality in a data centre (currents, voltages, performances, battery capacities, ...)
  • Application-specific data (e.g. length of mail queues from MS Exchange)
  • and much more...

Check_MK records all measurements accumulated by the monitoring over a period (adjustable) of four years, so that it can not only access the current data, but also historic data. In order to keep disk space usage under control, older data is compressed. The actual performance data itself is detected by the individual Check plugins. These also determine exactly which values are provided.

2. Access via the GUI

A service's performance data is presented in three different forms in the GUI. The so-called Perf-O-Meter appears directly in a host's or service's tables and offers a quick overview and a visual comparison. These are however usually limited by space to a selected single metric. For file systems this is, e.g. the percentage of space used:

You can view all of a service's metrics in elapsed time either by moving the cursor over, or by clicking on the graphs icon. The same graphs can also be very easily found in a host's or service's details:

A table with the precise current performance data for all metrics can also be found in the host's or service's details:

How precisely the time ranges are displayed depends on your Check_MK edition:

3. The enterprise edition's graphing

From Version 1.2.8 the  Check_MK Enterprise Edition includes a completely newly-developed, self-contained interface for the visualisation of historic performance data, based on interactive HTML 5. Additionally there is a native presentation in PDF - in fact directly as a vector graphic in PDF format, and so without visible pixels when printed.

3.1. Interaction with the graphs

You can interactively influence the displaying of graphs in various ways:

  • By panning - click and hold the left mouse button and drag the time range (left/right), or scale it vertically (up/down)
  • Using the mouse-wheel to zoom in or out of the time range
  • By dragging the graph's lower right corner to change its size
  • Clicking on a position in a graph sets a pin. In this way you can identify a point's exact time and all of the precise performance data for this moment. The pin's time is saved for each user and displayed in all graphs:

If a page includes several graphs, all of the graphs conform to changes made to the time range and the pin. Thus the values can always be compared. Likewise, scaling affects all of the graphs. These readjustments first take place with a page refresh (otherwise there could be chaos on the display at times...)

4. Graph collections

With the menu symbol you can embed graphs in various displays, e.g. in reports or dashboards. Graph Collections are very useful here. In such a graph collection you can pack as many graphs as desired and later compare or export them as PDFs. By default every user has a graph collection named My Graphs. You can very easily add new ones and even make them visible to other users. The procedure is exactly the same as that for Views. You access your graph collection via the Views element in the sidebar:

The button takes you to the table listing all of your graph collections and enables you to add new ones, to modify, etc.

5. Custom graphs

Version 1.2.8 of the  Check_MK Enterprise Edition for the first time made a graphic editor available with which you can create your own complete graphs with their own calculation formulae. With this it is now also possible to combine values from different hosts and services into a single graph. You access the custom graphs, e.g. via the Views ➳ EDIT and then with the button.

An alternative method is via a service's metric table. Here a symbol is available for every metric allowing you to add the metric to a custom graph:

The following image shows a list of the custom graphs (here with only a single entry):

There are four possible operations for every existing graph:

Creates a copy of this graph
Deletes this graph
Opens this graph's general characteristics. Here, as well as the graph's title, you can also define its visibility for other users. All are exactly the same as with views. Remember the online help if you have questions regarding these settings.
Here you can access the actual graph designer, in which the graph's content can be modified.

Note that every custom graph - analagous to the views - has a unique ID. This ID is addressed in reports and dashboards. If you later change a graph's ID these links will be broken. All graphs that are not hidden are displayed under Views ➳ Metrics in the sidebar.

5.1. The graph designer

The graph designer is divided into four components:

5.2. Graph preview

Here you can see the graph exactly as it will be seen live. You can also use all of its interactive functions.

5.3. Metrics list

Curves included in the graphs can be directly-edited here. Modifying a curve's title in this field is confirmed with the 'Enter' key. The Style defines how the values in graphs are displayed. Here are the possible options:

LineThe value will be drawn as a line
AreaThe value will be drawn as an area. Be aware that curves positioned higher in the list have priority over and could cover later ones. If you wish to combine lines and areas, the areas should always be positioned below the lines.
Stacked AreaAll curves with this style will be drawn as areas and stacked according to their values (in effect added). The upper limit of this stack therefore symbolises the sum of all the curves in the graph.

The three further possibilities - Mirrored Line, Mirrored Area and Mirrored Stacked are analogous, except that the curves are drawn downwards from the zero line. This enables a style of graph such as generally used by Check_MK in input/output graphs for the following:

In the last column of the metrics table you can edit existing metrics. This allows you to, e.g. simply clone a curve and then just substitute the host name. The meanings of the individual fields will be explained in the next section.

5.4. Adding a metric with formula

With the Metrics formula you can add metrics to graphs. As soon as you enter a valid host name in the first field, the second field will be filled with a list of the host's services. Making a selection in this list will fill the third field with a list of this service's metrics. In the fourth and last field, select the consolidation function. Here the options are Minimum, Maximum and Average. These functions are used when the data stored in the RRDs for the specified time range is already compressed. In a range where, e.g. only one value per half hour is available, you can choose to plot the minimum, maximum, or average original measured value for this time range.

You can also add a constant to the graph. This will at first be drawn as a horizontal line. Constants are sometimes required to generate calculation formulae. More on this later.

5.5. Graph options

Here you can find options that affect complete graphs. Unit influences the labelling of the axes and legends. It will be set automatically when the first metric is added. Note that it is possible, but not advisable, to add two metrics with differing units to a single graph.

With Explicit vertical range you can predefine a graph's vertical axis. The Y-axis is normally scaled so that all values in the chosen time range fit exactly in the graph. If you create a graph for, e.g. percentage values, you can choose to always display the values from 0 to 100. Note though, that users (and you yourself) can scale the graph using the mouse, making the positioning ineffective.

5.6. Calculating with formulae

The graph designer makes it possible for you to combine the individual curves using calculations. The following example shows a graph with two curves: CPU utilisation, for User and System.

Let's assume that for this graph, you're only interested in the sum of the two curves. For this you check the selection boxes for both curves. When this is done, in the Metrics box a new line with a row of buttons {{Operation on selected metrics}} appears:

Clicking on Sum will combine both curves into a new curve. This new curve's colour will automatically be a mixture of the the input curves' colour. The new curve will be titled Sum of User, System. The formula used for the calculation will be shown in the Formula column. In addition a new symbol will appear:

Clicking on works as a quasi 'undo' through which the original individual curves can again be displayed. Further tips for calculation operations:

  • It is sometimes sensible to include constants to, e.g. subtract a curve's value from the number 100.
  • The operations can be nested in any order.

6. The PNP4Nagios graphic interface

The graphing system in the  Check_MK Raw Edition is generated by PNP4Nagios von Jörg Linge as a basis for the capturing and visualisation of performance data. This is an independent project written in the PHP language and which can also be used without Check_MK. It is popular mainly with users of conventional Nagios-based monitoring systems. PNP4Nagios is integrated in the Check_MK display via a frame and its colours have been matched to Check_MK as well:

6.1. Selecting a time range

You have various possibilities for choosing the time range to be displayed:

  • Using the mouse you can directly select a range in a graph
  • The magnifying glass opens a dialogue with buttons for scrolling and zooming
  • The calendar enables the entry of dates and times
  • In the Timeranges box you can select from five standard time ranges (e.g. One Month)

6.2. The basket

With the icon in your Basket you can 'collect' multiple graphs for later simultaneous viewing via My basket. You can likewise simultaneously view and more easily compare graphs from various hosts.

6.3. PDF export

The button allows the current display to be easily exported as a PDF.

7. Graphite, Grafana and InfluxDB

When you install the  Check_MK Enterprise Edition, in addition to Check_MK's built-in graphing you can also link external metrics data bases in parallel. The Check_MK Micro Core can additionally send all performance data to a data base (from Version 1.2.8 to multiple data bases) that supports the Graphite protocol. Alongside Graphite itself, the InfluxDB for example has such an interface.

The connection is configured in Global Settings under Send metrics to Graphite / InfluxDB:

Alongside the obvious network information, here you can also optionally configure a prefix to be appended to every host name in order to, for example, force the use of explicite names. HOST.SERVICE.METRIK is used as the naming schema for the metric export.

If a connection doesn't function diagnostic information can be found in the ~/var/log/cmc.log. The following example shows the messages in the case of an unsuccessful connection to a Graphite-Server:

/omd/sites/mysite/var/log/cmc.log
2016-02-24 16:30:48 [5] Successfully initiated connection to Carbon/Graphite at 10.0.0.5:2003.
2016-02-24 16:32:57 [4] Connection to Carbon/Graphite at 10.0.0.5:2003 failed: Connection timed out
2016-02-24 16:32:57 [5] Closing connection to Carbon/Graphite at 10.0.0.5:2003

In such situations the core automatically makes repeated attempts to build the connection. During the time a connection is down, accumulated data is not buffered and is therefore lost. (e.g. it is only available in Check_MK's RRD-data base).

8. Background, tuning, fault diagnosis

Check_MK stores all data in specially-developed data bases, the so-called RRDs (Round Robin Data bases). Here the RRDTool from Tobi Oetiker can be used, which is very popular and widely-used for open-source projects.

The RRDs offer important advantages for data storage in comparison to classic SQL data bases:

  • RRDs store data very compactly and efficiently.
  • The space used per metric on the drive is static. RRDs can neither grow nor shrink. The required space can be easily planned.
  • The CPU and disk time per update is always the same. RRDs are (virtually) real-time capable, so that reorganisations can't cause data jams.

8.1. Organisation of data in RRDs

By default Check_MK is configured so that the performance of every metric is recorded over a four year time range. The base resolution used is one minute. This is sensible as the check interval is preset at one minute, so that precisely once per minute new data will be received from every service.

Obviously, storing one value per minute over a four-year period will require an enormous amount of drive space (although the RRDs require only 8 bytes per measured value). For this reason, over time the data is compressed. The first compression is at 48 hours. From this time only one value will be stored every five minutes. Further stages are after 10 days and 90 days:

PhaseDurationResolutionMeasuring points
12 days1 minute2880
210 days5 minutes2880
390 days30 minutes4320
44 years6 hours5840

The obvious question now is - how best to consolidate five values meaningfully into one? For this the consolidation functions - maximum, minimum and average are available. What is meaningful in practice depends on the application or point of view. If, for example, you wish to monitor the temperature changes in a processing centre over a four-year period, the maximum temperature recorded is probably of most interest. For an application's access rates an average could be of more interest.

For the maximum flexibility, Check_MK's RRDs are simply preset to store all three values at once - minimum, maximum and average. For each compression level and consolidation function the RRD includes a 'ring' style of storage - a so-called RRA (Round Robin Archive). In the standard structure there are 12 RRAs. The standard structure for Check_MK therefore requires 384.952 bytes per metric. This number is derived from: 2880 + 2880 + 4320 + 5840 measurement points, times three consolidation functions, times 8 bytes per measured value - which gives a total of exactly 382,080 bytes. Adding the data header of 2872 bytes gives the final size of 384.952 bytes quoted above.

An interesting alternative schema would be, e.g. to store one value per minute for an entire year. This method would have a small advantage: the RRDs would at all times have the optimal resolution, and could thereby dispense with consolidation and, e.g. only generate average values. Calculating 365 x 24 x 60 measurement points, times 8 bytes, the result is a total of almost exactly 4 MB per metric. In this way although the RRDs have a tenfold storage requirement, the disk IO is actually reduced! The reason: it is no longer necessary to store and update in twelve separate RRAs, instead only one is needed.

8.2. Customising the RRD structure

If the preset storage schema does not suit you, it can be altered via the configuration rules (and even different versions per host or service are possible). The required rules set can be most easily found via the rules search - thus {{WATO|Host & Service Parameters|Search for rules sets}} - and once there simply enter RRD. Here you will find the rule Configuration of RRD databases of services. There is also a rule for Configuration of RRD databases of hosts , but hosts have performance data only in exceptional cases. The image below shows the rule with the default settings (from version 1.2.8 these will be generated automatically when creating a new instance):

In the Consolidation Functions and RRA Configuration submenus you can define and set the number and size of the compression phases ready for use by consolidations. The Step field defines the resolution in seconds, and as a rule is 60 (one minute). For services with a check interval of less than a minute it can be sensible to set this number lower. Note however that the value in the Number of steps aggregated into one data point field will no longer represent minutes, but instead the time interval set in Step.

Every change to the RRD structure initially has an effect only on newly created RRDs - that is to say, on hosts or services newly-incorporated into the monitoring. You can also allow Check_MK to restructure existing RRDs. This is performed by the cmk --convert-rrds command, with which the -v (verbose) option is always available. Check_MK will then inspect all existing RRDs and restructure them as needed into the defined target format:

OMD[mysite]:~$ cmk -v --convert-rrds
myserver012:
   Uptime (CMC).....converted, 376 KB -> 159 KB
   Filesystem / (CMC).....converted, 1873 KB -> 792 KB
   OMD slave apache (CMC).....converted, 14599 KB -> 6171 KB
   Memory (CMC).....converted, 14225 KB -> 6012 KB
   Filesystem /home/mk (CMC).....converted, 1873 KB -> 792 KB
   Interface 2 (CMC).....converted, 4119 KB -> 1741 KB
   CPU load (CMC).....converted, 1125 KB -> 475 KB

The command is intelligent enough to recognise RRDs that already have the desired structure:

OMD[mysite]:~$ cmk -v --convert-rrds
myserver345:
   Uptime (CMC).....uptodate
   Filesystem / (CMC).....uptodate
   OMD slave apache (CMC).....uptodate
   Memory (CMC).....uptodate
   Filesystem /home/mk (CMC).....uptodate
   Interface 2 (CMC).....uptodate
   CPU load (CMC).....uptodate

If the new format has a higher resolution or extra consolidation functions, the existing data will be interpolated as best it can be so that the RRDs will contain the most meaningful values possible. It is however naturally obvious that if, for example, instead of 2, you now require 5 days with values at one minute intervals, then the accuracy of the existing data cannot be retroactively increased.

8.3. The RRD storage format

The rule described above has a further setting: RRD storage format. With this you can choose between two methods that Check_MK can use when creating RRDs. This setting has existed since version 1.2.8. In this version the new One RRD per host/service format (Check_MK format or CMK format for short) was introduced. By using this, all of a host's or service's metrics will be packed into a single RRD. This enables more efficient writing of data on a drive, so that a complete set of metrics can always be written in a single operation. These metrics are then located in neighbouring storage blocks, thereby reducing the number of blocks that must be written to the disk.

Please note that the One RRD per host/service format is not supported by PNP4Nagios. Check_MK instances created from version 1.2.8 of the  Check_MK Enterprise Edition automatically use the new format. Existing instances from earlier versions retain the old PNP format. By applying a rule as shown in the rule set above you can convert it into the Check_MK format. Here you will also subsequently require the cmk --convert-rrds command:

OMD[mysite]:~$ cmk -v --convert-rrds
myhost123:
    Uptime PNP -> CMC..converted.
   WARNING: Dupliate RRDs for stable/Uptime. Use --delete-rrds for cleanup.
    OMD heute apache PNP -> CMC..converted.
   WARNING: Dupliate RRDs for stable/OMD heute apache. Use --delete-rrds for cleanup.
    fs_/home/mk PNP -> CMC..converted.
   WARNING: Dupliate RRDs for stable/fs_/home/mk. Use --delete-rrds for cleanup.
    OMD slave apache PNP -> CMC..converted.
   WARNING: Dupliate RRDs for stable/OMD slave apache. Use --delete-rrds for cleanup.
    Memory PNP -> CMC..converted.
...

You can see by the warning that Check_MK at first leaves the existing files unaltered. This enables you, if in doubt, to return to this data format, since a conversion in the reverse direction is not possible. The --delete-rrds option ensures that this copy is not created, or is later deleted. You can easily perform the deletion later by again using the command:

OMD[mysite]:~$ cmk -v --convert-rrds --delete-rrds

8.4. The RRD cache daemon (rrdcached)

In order to (drastically) reduce the number of write accesses to a disk drive, an auxilliary service can be used: the RRD cache daemon (rrdcached). This is one of the services started when an instance is started:

OMD[mysite]:~$ omd start
Starting mkeventd (builtin: syslog-udp)...OK
Starting Livestatus Proxy-Daemon...OK
Starting mknotifyd...OK
Starting rrdcached...OK
Starting Check_MK Micro Core...OK
Starting dedicated Apache for site stable...OK
Initializing Crontab...OK

All new performance data for the RRDs are sent from the ( Check_MK Enterprise Edition) core or from the NPCD ( Check_MK Raw Edition) to the rrdcached. This does not write the data directly into the RRDs, but rather holds it in main memory for later writing as a collection to the respective RRD. In this way the number of write accesses to the disk drive (or to the SAN!) are noticibly reduced.

So that no data is lost in the case of a restart the updates are additionally written to journal log files. These are also write accesses, but as the data is arranged sequentially, they generate little IO.

For the RRD cache daemon to be able to work efficiently, it needs a lot of main memory. The amount required depends on the number of RRDs and on how long the data should be cached. The latter can be defined in the etc/rrdcached.conf data. The standard setting is to store for 7200 seconds (two hours) - this value can be customised by the user - plus a random range of 0-1800 seconds (this maximum value is fixed). This randomised delay per RRD averts 'pulsed' writing and ensures that IO is distributed regularly over time:

# Data is written to disk every TIMEOUT seconds. If this option is
# not specified the default interval of 300 seconds will be used.
TIMEOUT=3600

# rrdcached will delay writing of each RRD for a random
# number of seconds in the range [0,delay).  This will avoid too many
# writes being queued simultaneously.  This value should be no
# greater than the value specified in TIMEOUT.
RANDOM_DELAY=1800

# Every FLUSH_TIMEOUT seconds the entire cache is searched for old values
# which are written to disk. This only concerns files to which
# updates have stopped, so setting this to a high value, such as
# 3600 seconds, is acceptable in most cases.
FLUSH_TIMEOUT=7200

Activate an alteration to the settings in this file with:

OMD[mysite]:~$ omd restart rrdcached
Stopping rrdcached...waiting for termination....OK
Starting rrdcached...OK

8.5. Directories

Here is an overview of the most important files and indexes associated with performance data and RRDS (all those related to the instance's home index):

var/check_mk/rrdRRDs in Check_MK format
var/pnp4nagios/perfdataRRDs in the old format (PNP)
var/rrdcachedJournal log files of the RRD cache daemon
var/log/rrdcached.logLog file of the RRD cache daemon
var/log/cmc.logLog file of the Check_MK core (error messages from RRDs)
etc/pnp4nagiosSettings for PNP4Nagios ( Check_MK Raw Edition)
etc/rrdcached.confSettings for the RRD cache daemon