MK Livestatus


This article is obsolete and may be not valid anymore!

1. How to access Nagios status data

1.1. Accessing status data today

The classical way of accessing the current status of your hosts and services is by reading and parsing the file status.dat, which is created by Nagios on a regular basis. The update interval is configured via status_update_interval in nagios.cfg. A typical value is 10 seconds. If your installation is getting larger, you might have to increase this value in order to minimize CPU usage and disk IO. The nagios web interface uses status.dat for displaying its data.

Parsing status.dat is not very popular amongst developers of addons. So many use another approach: NDO. This is a NEB module that is loaded directly into the Nagios process and sends out all status updates via a UNIX socket to a helper process. The helper process creates SQL statements and updates various tables in a MySQL or PostgreSQL database. This approach has several advantages over status.dat:

  • The data is updated immediatley, not only every 10 or 20 seconds.
  • Applications have easy access to the data via SQL. No parser for status.dat is needed.
  • In large installations the access for the addons to the data is faster then reading status.dat.

Unfortunately, however, NDO has also some severe shortcomings:

  • It has a complex setup.
  • It needs a (rapidly growing) database to be administered.
  • It eats up a significant portion of your CPU ressources, just in order to keep the database up-todate.
  • Regular housekeeping of the database can hang your Nagios for minutes or even an hour once day.

1.2. The Future

Since version 1.1.0, Check_MK offers a completely new approach for accessing status and also historic data: Livestatus. Just as NDO, Livestatus make use of the Nagios Event Broker API and loads a binary module into your Nagios process. But other then NDO, Livestatus does not actively write out data. Instead, it opens a socket by which data can be retrieved on demand.

The socket allows you to send a request for hosts, services or other pieces of data and get an immediate answer. The data is directly read from Nagios' internal data structures. Livestatus does not create its own copy of that data. Beginning from version 1.1.2 you are also able retrieve historic data from the Nagios log files via Livestatus.

This is not only a stunningly simple approach, but also an extremely fast one. Some advantages are:

  • Unlike NDO, using Livestatus imposes no measurable burden on your CPU at all. Just when processing queries a very small amount of CPU is needed. But that will not even block Nagios.
  • Livestatus produces zero disk IO when quering status data.
  • Accessing the data is much faster then parsing status.dat or querying an SQL database.
  • No configuration is needed, No database is needed. No administration is neccessary.
  • Livestatus scales well to large installations, even beyond 50.000 services.
  • Livestatus gives you access to Nagios-specific data not available to any other available status access method - for example the information whether a host is currently in its notification period.

At the same time, Livestatus provides its own query language that is simple to understand, offers most of the flexibility of SQL and even more in some cases. It's protocol is fast, light-weight and does not need a binary client. You can even get access from the shell without any helper software.

1.3. The Present

Livestatus is still a young technology, but already many addons support Livestatus as data source or even propose it as their default. Here is an (incomplete) list of addons with Livestatus support:

Please mail us if you think this list is incomplete.

2. Setting up and using Livestatus

2.1. Automatic setup

The typical way to setup Livestatus is just to answer yes when asked by the Check_mk setup. You need to you have all tools installed that are needed to compile C++ programs. These are at least:

  • The GNU C++ compiler (packaged as g++ in Debian)
  • The utility make (packaged as make)
  • The development files for the libc (libc6-dev)
  • The development files for the C++ standard library (libstdc++6-dev)

The script setup.sh compiles a module called livestatus.o and copies it into /usr/lib/check_mk (if you didn't change that path). It also adds two lines to your nagios.cfg, which are needed for loading the module. After that you just need to restart Nagios and a Unix socket with the name live should appear in the same directory as you Nagios command pipe.

2.2. Manual setup

There are several situations in which a manual setup is preferable, for example:

  • If you do not want to use Check_MK, but just Livestatus
  • If the automatic setup does not work correctly (which is unlikely but not impossible).
  • If you want to make changes to the source code of Livestatus.

For manually setting up Livestatus, you can download the source code independent of Check_MK at the download page. Unpack the tarball at a convenient place and change to the newly created directory:

root@linux# wget 'http://www.mathias-kettner.de/download/mk-livestatus-1.2.8.tar.gz'
root@linux# tar xzf mk-livestatus-1.2.8.tar.gz
root@linux# cd mk-livestatus-1.2.8

Now let's compile the module. Livestatus uses a standard configure-script and is thus compiled with ./configure && make.

user@host:~$ ./configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for g++... g++
checking for C++ compiler default output file name... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes

... and so on, until:

configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/Makefile
config.status: creating config.h
config.status: config.h is unchanged
config.status: executing depfiles commands

If you are running on a multicore CPU you can speed up compilation by adding -j 4 or -j 8 to make:

user@host:~$ make -j 8
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-AndingFil...
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-ClientQue...
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-Column.o ...
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-ColumnsCo...
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-ContactsC...
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-CustomVar...
g++ -DHAVE_CONFIG_H -I. -I..    -I../nagios -fPIC -g -O2 -MT livestatus_so-CustomVar...

Werk #0276

Livestatus now supports Nagios 4

... and so on.. After successful compilation, a make install will install a single file named livestatus.o into /usr/local/lib/mk-livestatus and the small program unixcat into /usr/local/bin (as usual, you can change paths with standard options to configure):

root@linux# make install
Making install in src
make[1]: Entering directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1/src'
make[2]: Entering directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1/src'
test -z "/usr/local/bin" || /bin/mkdir -p "/usr/local/bin"
  /usr/bin/install -c 'unixcat' '/usr/local/bin/unixcat'
test -z "/usr/local/lib/mk-livestatus" || /bin/mkdir -p "/usr/local/lib/mk-livestatus"
 /usr/bin/install -c -m 644 'livestatus.so' '/usr/local/lib/mk-livestatus/livestatus.so'
 ranlib '/usr/local/lib/mk-livestatus/livestatus.so'
/bin/sh /d/nagvis-dev/src/mk-livestatus-1.1.6p1/install-sh -d /usr/local/lib/mk-livestatus
/usr/bin/install -c livestatus.o /usr/local/lib/mk-livestatus
rm -f /usr/local/lib/mk-livestatus/livestatus.so
make[2]: Leaving directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1/src'
make[1]: Leaving directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1/src'
make[1]: Entering directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1'
make[2]: Entering directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1'
make[2]: Nothing to be done for `install-exec-am'.
make[2]: Nothing to be done for `install-data-am'.
make[2]: Leaving directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1'
make[1]: Leaving directory `/d/nagvis-dev/src/mk-livestatus-1.1.6p1'

Your last task is to load livestatus.o into Nagios. Nagios is told to load that module and send all status update events to the module by the following two lines in nagios.cfg:

nagios.cfg
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /var/lib/nagios/rw/live
event_broker_options=-1

The only mandatory argument is the complete path to the UNIX socket that Livestatus shall create (/var/lib/nagios/rw/live in our example). Please change that if needed. The best is probably to put it into the same directory as the Nagios pipe. Just as Nagios does with its pipe, Livestatus creates the socket with the permissions 0660. If the directory that the socket is located in has the SGID bit for the group set (chmod g+s), then the socket will be owned by the same group as the directory.

After setting up Livestatus - either by setup.sh or manually - restart Nagios. Two things should now happen:

  1. The socket file is created.
  2. The logfile of Nagios shows that the module has been loaded:
nagios.log
[1256144866] livestatus: Version 1.1.6p1 initializing. Socket path: '/var/lib
/nagios/rw/live'
[1256144866] livestatus: Created UNIX control socket at /var/lib/nagios/rw/
live
[1256144866] livestatus: Opened UNIX socket /var/lib/nagios/rw/live
[1256144866] livestatus: successfully finished initialization
[1256144866] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initializ
ed successfully.
[1256144866] Finished daemonizing... (New PID=5363)
[1256144866] livestatus: Starting 10 client threads
[1256144866] livestatus: Entering main loop, listening on UNIX socket

2.3. Options for nagios.cfg

Livestatus understands several options, which can be added to the line beginning with broker_module:

optiondefault valuemeaning
debug0Set this to 1 in order to make Livestatus log each query it executes in nagios.log
max_cached_messages500000Livestatus' access to Nagios logfiles caches messages in-memory. Here you can set the maximum number of cached messages. Each message takes about 250 bytes (in the current implementation).
max_response_size104857600Livestatus constructs each response in-memory before sending it to the clients. In order to avoid a crash in case of extensive queries, the maximum response size is limited. The default limit is 100 MB.
num_client_threads10Livestatus needs one thread for each concurrent client connection. A fixed number of threads is created when Nagios starts.
thread_stack_size65536This parameter sets the size of the stack of each client thread. In versions before 1.1.4, the stack size was set to 8 MB (pthread default). The new default value is 64 KB. A small stack reduces virtual memory usage and also saves CPU ressources. A too small value will probably crash your Nagios process, though. You have been warned...
query_timeout10000This value is in ms. In order to avoid being hung by broken clients, Livestatus imposes a limit on the time for reading the query from the client. A value of 0 disables the timeout.
idle_timeout300000This value is in ms. Livestatus is waiting at most that much time for the next query. A value of 0 disables the timeout.
pnp_pathThe base directory where PNP4Nagios stores its round robin databases. If you add this paramter, then Livestatus will provide information about which host and service actually has performance data stored in an RRD in the new column pnpgraph_present. Per default the path is empty and pnpgraph_present set to -1, which means unknown.
inventory_path1.2.5i7 The base directory where Check_MK stores its files for Hardware/Software-Inventory. If you add this paramter, then Livestatus will provide information about which host has inventory data and also gives access to that data in form of a couple of additional columns. Per default the path is empty and inventory_present set to 0, which means no.
data_encodingutf81.1.11i2 Specifies the input encoding of the configuration files. Possible values are utf8, latin1 and mixed.
log_file livestatus.log in same directory as nagios.log 1.1.12b1 Specify the path to the livestatus.log where log entries after initialization are placed.

Here is an example of how to add parameters:

nagios.cfg
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /var/run/nagios/rw/live debug=1

3. Using Livestatus

Once your Livestatus module is setup and running, you can use its unix socket for retrieving live status data. Every relevant programming language on Linux has a way to open such a socket. We will show how to access the socket with the shell and with Python. Other programming languages are left as an exercise to the reader.

3.1. Accessing Livestatus with the shell

A unix socket is very similar to a named pipe, but has two important differences:

  • You can both read and write to and from it (while a pipe is unidirectional).
  • You cannot access it with echo or cat.

Livestatus ships with a small utility called unixcat which can communicate over a unix socket. It sends all data is reads from stdin into the socket and writes all data coming from the socket to stdout.

The following command shows how to send a command to the socket and retrieve the answer - in this case a table of all of your hosts:

root@linux# echo 'GET hosts' | unixcat /var/lib/nagios/rw/live
acknowledged;action_url;address;alias;check_command;check_period;checks_ena
bled;contacts;in_check_period;in_notification_period;is_flapping;last_check
;last_state_change;name;notes;notes_url;notification_period;scheduled_downt
ime_depth;state;total_services
0;/nagios/pnp/index.php?host=$HOSTNAME$;127.0.0.1;Acht;check-mk-ping;;1;che
ck_mk,hh;1;1;0;1256194120;1255301430;Acht;;;24X7;0;0;7
0;/nagios/pnp/index.php?host=$HOSTNAME$;127.0.0.1;DREI;check-mk-ping;;1;che
ck_mk,hh;1;1;0;1256194120;1255301431;DREI;;;24X7;0;0;1
0;/nagios/pnp/index.php?host=$HOSTNAME$;127.0.0.1;Drei;check-mk-ping;;1;che
ck_mk,hh;1;1;0;1256194120;1255301435;Drei;;;24X7;0;0;4

If you get that output, everything is working fine and you might want to continue reading with the chapter The Livestatus Query Language.

3.2. Accessing Livestatus with Python

Access from within Python does not need an external tool. The following example shows how to send a query, retrieve the answer and parse it into a Python table. After installing check_mk you find this program in the directory /usr/share/doc/check_mk:

live.py
#!/usr/bin/python
#
# Sample program for accessing the Livestatus Module
# from a python program
socket_path = "/var/lib/nagios/rw/live"

import socket
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.connect(socket_path)

# Write command to socket
s.send("GET hosts\n")

# Important: Close sending direction. That way
# the other side knows we are finished.
s.shutdown(socket.SHUT_WR)

# Now read the answer
answer = s.recv(100000000)

# Parse the answer into a table (a list of lists)
table = [ line.split(';') for line in answer.split('\n')[:-1] ]

print table

4. LQL - The Livestatus Query Language

LQL - pronounced "Liquel" as in "liquid" - is a simple language for telling Livestatus what data you want and how it should be formatted. It does much the same as SQL but does it in another, simpler way. Its syntax reflects (but is not compatible to) HTTP.

Each query consists of:

  • A command consisting of the word GET and the name of a table.
  • An arbitrary number of header lines consisting of a keyword, a colon and arguments.
  • An empty line or the end of transmission (i.e. the client closes the sending direction of the socket)

All keywords including GET are case sensitive. Lines are terminated by single linefeeds (no <CR>). The current version of Livestatus implements the following tables:

  • hosts - your Nagios hosts
  • services - your Nagios services, joined with all data from hosts
  • hostgroups - you Nagios hostgroups
  • servicegroups - you Nagios servicegroups
  • contactgroups - you Nagios contact groups
  • servicesbygroup - all services grouped by service groups
  • servicesbyhostgroup - all services grouped by host groups
  • hostsbygroup - all hosts group by host groups
  • contacts - your Nagios contacts
  • commands - your defined Nagios commands
  • timeperiods - time period definitions (currently only name and alias)
  • downtimes - all scheduled host and service downtimes, joined with data from hosts and services.
  • comments - all host and service comments
  • log - a transparent access to the nagios logfiles (include archived ones)ones
  • status - general performance and status information. This table contains exactly one dataset.
  • columns - a complete list of all tables and columns available via Livestatus, including descriptions!
  • statehist - 1.2.1i2 sla statistics for hosts and services, joined with data from hosts, services and log.

Like in an SQL database all tables consist of a number of columns. If you query the table without any parameters, you retrieve all available columns in alphabetical order. The first line of the answer contains the names of the columns. Please note that the available columns will change from version to version. Thus you should not depend on a certain order of the columns!

Example: Retrieve all contacts:

query_1
GET contacts

4.1. Selecting which columns to retrieve

When you write an application using Livestatus, you probably need the information just from selected columns. Add the header Columns to select which columns to retrieve. This also defines the order of the columns in the answer. The following example retrieves just the columns name and alias:

query_2
GET contacts
Columns: name alias

If you want to test this with unixcat, a simple way is to put your query into a text file query and read that in using <:

root@linux# unixcat < query /var/lib/nagios/rw/live
check_mk;check_mk dummy contact
hh;Harri Hirsch

As you might have noticed in this example: if you use Columns: then no column headers will be output. You do not need them - as you have specified them yourselves. That makes parsing simpler.

4.2. Filters

An important concept of Livestatus is its ability to filter data for you. This is not only more convenient than just retrieving all data and selecting the relevant lines yourself. It is also much faster. Remember that Livestatus has direct access to all of Nagios' internal datastructures and can access them with the speed of native C.

Filters are added by using Filter: headers. Such a header has three arguments: a column name, an operator and a reference value - all separated by spaces. The reference value - being the last one in the line - may contain spaces. Example:

query_3
GET services
Columns: host_name description state
Filter: state = 2

This query gets all services with the current state 2 (critical). If you add more Filter: headers, you will see only data passing all of your filter. The next example outputs all critical services which are currently within their notification period:

query_4
GET services
Columns: host_name description state
Filter: state = 2
Filter: in_notification_period = 1

The following eight operators are available:

symbolon numberson strings
=equalityequality
~superset (see below)substring match via regular expression
=~subset (see below)case-insensitive equality
~~contains at least one of (see below)case-insensitive substring match via regular expression
<less thanlexicographically less than
>greater thanlexicographically greater than
<=less or equallexicographically less or equal
>=greater or equallexicographically greater or equal

A few notes:

  • The operators ~, =~, and ~~ interpret numbers as bit sets, which comes in handy when dealing with attribute lists, see below.
  • All operators can be negated by prefixing a !, so even funny looking operators like !< are allowed. The latter should better be written as >=, but always allowing negation in a uniform way makes some scripts easier.

4.3. Regular expression matching

The operators ~ and ~~ on strings do a match using POSIX extended regular expressions such as used by egrep. Some Linux distributions ship a manpage for those (man 7 regex). Livestatus always does a substring match. That means, that the text in question may appear somewhere in the text. You can use the anchors ^ and $ for matching the beginning or the end of the text. The following filter finds all services beginning with fs:

Filter: description ~ ^fs

4.4. Matching lists

Some columns do not contain numbers or texts, but lists of objects. An example for that is the column contacts of hosts or services which contains all contacts assigned to the data object. The available operators on list-values columns are:

symbolmeaning
=test for empty list
!=test for non-empty list
>=test if an element is contained in the list (using equality)
<test if an element is not contained in the list (using equality)
<=test if an element is contained in the list (using case-insensitive equality)
>test if an element is not contained in the list (using case-insensitive equality)
~test if an element is contained in the list (using substring regex match)
!~test if an element is not contained in the list (using substring regex match)
~~test if an element is contained in the list (using case-insensitive substring regex match)
!~~test if an element is not contained in the list (using case-insensitive substring regex match)

Example: Return some information about services where "harri" is one of the assigned contacts:

query_5
GET services
Columns: host_name description state contacts
Filter: contacts >= harri

Another example: Return the name of all hosts that do not have parents:

GET hosts
Columns: name
Filter: parents =

There is a special case when filtering is done on the members or members_with_state columns of the servicegroups table: The value to match must have the form hostname|servicedescription.

4.5. Matching attribute lists

Version 1.1.4 of Livestatus gives you access to the list of modified attributes of hosts, services, and contacts. This way you can query which attributes have been changed dynamically by the user and thus differ from the attributes configured in the Nagios object files.

These new columns come in two variants: modified_attributes and modified_attributes_list. The first variant outputs an integer representing a bitwise combination of Nagios' internal numbers. The second variant outputs a list of attribute names, such as notifications_enabled or active_checks_enabled. When you define a Filter, both column variants are handled in exactly the same way, and both allow using the number or the comma-separated list representation.

Example 1: Find all hosts with modified attributes:

GET hosts
Columns: host_name modified_attributes_list
Filter: modified_attributes != 0

Example 2: Find hosts where notification have been actively disabled:

GET hosts
Columns: host_name modified_attributes_list
Filter: modified_attributes ~ notifications_enabled
Filter: notifications_enabled = 0

Example 3: Find hosts where active or passive checks have been tweaked:

GET hosts
Columns: host_name modified_attributes_list
Filter: modified_attributes ~~ active_checks_enabled,passive_checks_enabled

4.6. Combining Filters with And, Or and Negate

Per default a dataset must pass all filters to be displayed. Alternatively, you can combine a number of filters with a logical "or" operation by using the header Or:. This header takes an integer number X as argument and combines the last X filters into a new filter using an "or" operation. The following example selects all services which are in state 1 or in state 3:

GET services
Filter: state = 1
Filter: state = 3
Or: 2

The next example shows all non-OK services which are within a scheduled downtime or which are on a host with a scheduled downtime:

GET services
Filter: scheduled_downtime_depth > 0
Filter: host_scheduled_downtime_depth > 0
Or: 2

It is also possible to combine filters with an And operation. This is only neccessary if you want to group filters together before "or"-ing them. If, for example, you want to get all services that are either critical and acknowledged or OK, this is how to do it:

GET services
Filter: state = 2
Filter: acknowledged = 1
And: 2
Filter: state = 0
Or: 2

The And: 2 - header combines the first two filters to one new filter, which is then "or"ed with the third filter.

In version 1.1.11i2 the new header Negate: has been introduced. This logically negates the most recent filter. The following example displays all hosts that have neither an a nor an o in their name:

GET hosts
Filter: name ~ a
Filter: name ~ o
Or: 2
Negate:

5. Stats and Counts

5.1. Why counting?

SQL has a statement "SELECT COUNT(*) FROM ..." which counts the number of rows matching certain criteria. LQL's Stats:-Header allows something similar. In addition it can retrieve several counts at once.

The Stats:-Header has the same syntax as Filter: but another meaning: Instead of filtering the objects it counts them. As soon as at least one Stats: header is used, no data is displayed anymore. Instead, one single row of data is output with one column for each Stats:, showing the number of rows matching its criteria.

The following example outputs the numbers of services which are OK, WARN, CRIT or UNKNOWN:

query_6
GET services
Stats: state = 0
Stats: state = 1
Stats: state = 2
Stats: state = 3

An example output looks like this:

user@host:~$ unixcat /var/lib/nagios/rw/live < query_6
4297;13;9;0

You want to restrict the output to services to which the contact harri is assigned to? No problem, just add a Filter: header:

query_7
GET services
Stats: state = 0
Stats: state = 1
Stats: state = 2
Stats: state = 3
Filter: contacts >= harri

5.2. Combining with and/or

Just as the Filter headers, the Stats-headers can be combined with and and/or or operations. Important to know is, that they form their own stack. You combine them with StatsAnd and StatsOr. Here is a somewhat more complex query that scans all services of the service group windows which are within their notification period and are not within a host or service downtime. It computes seven counts:

  1. The number of services with the hard state OK
  2. The number of unacknowledged services in hard state WARNING
  3. The number of acknowledged services in hard state WARNING
  4. The number of unacknowledged services in hard state CRITICAL
  5. The number of acknowledged services in hard state CRITICAL
  6. The number of unacknowledged services in hard state UNKNOWN
  7. The number of acknowledged services in hard state UNKNOWN
GET services
Filter: host_groups >= windows
Filter: scheduled_downtime_depth = 0
Filter: host_scheduled_downtime_depth = 0
Filter: in_notification_period = 1
Stats: last_hard_state = 0
Stats: last_hard_state = 1
Stats: acknowledged = 0
StatsAnd: 2
Stats: last_hard_state = 1
Stats: acknowledged = 1
StatsAnd: 2
Stats: last_hard_state = 2
Stats: acknowledged = 0
StatsAnd: 2
Stats: last_hard_state = 2
Stats: acknowledged = 1
StatsAnd: 2
Stats: last_hard_state = 3
Stats: acknowledged = 0
StatsAnd: 2
Stats: last_hard_state = 3
Stats: acknowledged = 1
StatsAnd: 2

In version 1.1.11i2 the new header StatsNegate: has been introduced. It takes no arguments and logically negates the most recent stats-Filter.

5.3. Grouping

Letting Livestatus count items is nice and fast. But in our examples so far the answer was restricted to one line of numbers for a predefined set of filters. In some situations you want to get statistics for each object from a certain set. You might want to display a list of hosts, and for each of these hosts the number of services which are OK, WARN, CRIT or UNKNOWN.

In such situations you can add the Columns: header to your query. There is a simple and yet mighty notion behind it: You specify a list of columns of your table. The stats are computed and displayed separately for each different combination of values of these columns.

The following query counts the number of services in the various states for each host in the host group windows:

GET services
Filter: host_groups >= windows
Stats: state = 0
Stats: state = 1
Stats: state = 2
Stats: state = 3
Columns: host_name

The output looks like this:

winhost01;7;0;0;0
winhost02;7;0;1;0
srvabc44;7;0;1;0
srvabc45;2;0;1;0
termsv1;7;0;1;0
termsv2;3;0;1;1

As you can see, an additional column was prepended to the output holding the value of the group column. Here is another example that counts the total number of services grouped by the check command (the dummy filter expression is always true, so each service is counted).

query
GET services
Stats: state != 9999
Columns: check_command

Here is an example output of that query:

root@linux# unixcat < query /var/lib/nagios/rw/live
check-mk;14
check-mk-dummy;12
check-mk-inventory;14
check-mk-ping;2
check_mk-cpu.loads;2
check_mk-cpu.threads;2
check_mk-df;11
check_mk-diskstat;24
check_mk-ifoperstatus;7
check_mk-kernel.util;12
check_mk-local;6
check_mk-logwatch;4
check_mk-mem.used;13
check_mk-netctr.combined;12
check_mk-netif.link;2
check_mk-netif.params;2

A third example shows another way for counting the total number of services grouped by their states without an explicit Stat-header for each state:

query
GET services
Stats: state != 9999
Columns: state

And the output:

root@linux# unixcat < query /var/lib/nagios/rw/live
0;113
1;2
2;28

In that example none of the services was in the state UNKNOWN. Hence no count for that state was displayed.

One last note about grouping: the current implementation allows only columns of the types stringor in to be used for grouping. Also you are limited to one group column.

Note: prior to version 1.1.10 there was the header StatsGroupBy: instead of Columns:. That header is deprecated, though still working.

6. Sum, Minimum, Maximum, Average, Standard Deviation

Starting from version 1.1.2 Livestatus supports some basis statistical operations. They allow you, for example, to query the average check execution time or the standard deviation of the check latency of all checks.

These operations are using one of the keywords sum, min, max, avg, std, suminv or avginv. The following query displays the minimum, maximum and average check execution time of all service checks in state OK:

query
GET services
Filter: state = 0
Stats: min execution_time
Stats: max execution_time
Stats: avg execution_time

As with the "normal" stats-headers, the output can be grouped by one column, for example by the host_name:

query
GET services
Filter: state = 0
Stats: min execution_time
Stats: max execution_time
Stats: avg execution_time
Columns: host_name

In version 1.1.13i1 we introduced the aggregation functions suminv and avginv. They compute the sum or the average of the inverse of the values. For example the inverse of the check_interval of a service is the number of times it is checked per minute. The suminv over all services is the total number of checks that should be executed per minute, if no checks are being delayed.

6.1. Performance Data

As of version 1.1.11i2, MK Livestatus now supports aggregation of Nagios performance data. Performance data is additional information output by checks, formatted as a string like user=6.934;;;; system=6.244;;;; wait=0.890;;;;. If you create a Stats-query using sum, min, max, avg or std on several services with compatible performance data, Livestatus will now aggregate these values into a new performance data string. Look at the following examples. First, a query of two services without aggregation:

query
GET services
Filter: description ~ CPU utilization
Columns: perf_data

Let's assume it produces the following output:

user=7.594;;;; system=5.814;;;; wait=0.923;;;;
user=6.934;;;; system=6.244;;;; wait=0.890;;;;

Here is the same query, but aggregating the data using the average:

query
GET services
Filter: description ~ CPU util
Stats: avg perf_data

This is the result:

system=6.02900000 user=7.26400000 wait=0.90650000

7. Output formatting and character encoding

Livestatus supports the output formats CSV, JSON and Python, with CSV being the default.

7.1. CSV output

CSV output comes in two flavors: csv (lowercase) and CSV (uppercase). For backwards compatibility reasons, the lowercase variant is the default, but it has quite a few quirks. The recommendation is to use the uppercase variant, and when you really need more structure in your data. you are much better off with JSON or Python.

csv output (broken)

Datasets are separated by Linefeeds (ASCII 10), fields are separated by semicolons (ASCII 59), list elements (such as in contacts) are separated by commas (ASCII 44) and combinations of host name and service description are separated by a pipe symbol (ASCII 124).

In order to avoid problems with the default field separator semicolon appearing in values (such as performance data), it is possible to replace the separator characters with other symbols. This is done by specifying four integer numbers after the Separators: header. Each of those is the ASCII code of a separator in decimal. The four numbers mean:

  1. The dataset separator (default is 10: linefeed)
  2. The column separator (default is 59: semicolon)
  3. The separator for lists such as contacts in the hosts table (default is 44: comma)
  4. The separator for hosts and services in service lists (default is 124: vertical bar)

It is even possible to use non-printable characters as separators. The following example uses bytes with the values 0, 1, 2 and 3 as separators:

GET hosts
Columns: name address state
Separators: 0 1 2 3

CSV output

This is the "real" CSV format (see RFC 4180) which is similar to the lowercase variant above, but with correct quoting and CR/LF as the dataset separator. Because of the quoting, there is no need for the Separators: header, so it is ignored for this format.

GET hosts
Columns: name address state
OutputFormat: CSV

7.2. JSON output

You can get your output in JSON format if you add the header OutputFormat: json, as in the following example:

GET hosts
Columns: name address state
OutputFormat: json

Like CSV, JSON is a text based format, and it is valid JavaScript code. In order to avoid redundancy and keep the overhead as low as possible, the output is not formatted as a list of objects (with key/value pairs), but as a list of lists (arrays in JSON speak). This is the recommended format in general, as it makes it extremely easy to handle structured data, and JSON parsers are available for basically every programming language out there.

7.3. Python output

The Python format is very similar to the JSON format, but not 100% compatible. There are tiny difference in string prefixes and how characters are escaped, and this is even different in Python 2 and Python 3. Therefore, two Pythonic formats are offered: python for Python 2 and python3 for, well, Python 3. You can directly eval() the Python output, but be aware of the potential security issues then. When in doubt, use JSON and json.loads from the standard json module.

GET hosts
Columns: name address state
OutputFormat: python

7.4. Character encoding

Livestatus output in most cases originates from configuration files of Nagios (the object configuration). Nagios does not impose any restrictions on how these files have to be encoded (UTF-8, Latin-1, etc). If you select CSV output, then Livestatus simply returns the data as it is contained in the configuration files - with the same encoding.

When using JSON or Python - however - non-ASCII-characters need to be escaped and properly encoded. Up to version 1.1.11i1, Livestatus automatically detects 2-Byte UTF-8 sequences and assumes all other non-ASCII characters to be Latin-1 encoded. While this works well for western languages and to a certain degree "auto-detects" the encoding, it does not support languages using other characters then those used in Latin-1. Even the €-Symbol is not working.

As of version 1.1.11i2, Livestatus' behaviour is configurable with the option data_encoding and defaults now to UTF-8 encoding. Three different settings are valid:

utf8All characters are assumed to be UTF-8 encoded. Sequences up to 4 bytes are correctly recognized and escaped. This is the new default behaviour.
latin1All characters are assumed to be Latin-1 encoded. Use this mode only if you are shure that your input files are Latin-1 encoded. You also can convert them to UTF-8 with recode latin1..utf8 somefile.cfg.
mixedThis option selects the legacy behaviour of version before 1.1.11i2, which has been described above.

7.5. Column headers

Per default, if there is no Columns-header in your query, MK Livestatus displays the names of all columns as a first line of the output. With the header OutputColumns you can explicitely switch column headers on or off. The output to the following query will include column headers:

GET hosts
Columns: name alias address state
ColumnHeaders: on

7.6. Limitting the number of datasets

The Limits-header allows you to limit the number of datasets being displayed. Since MK Livestatus currently does not support sorting, you'll have to live with the Nagios-internal natural sorting of objects. Hosts, for example, are sorted according to their host names - just as in the standard CGIs. The following example will output just the first 10 hosts:

GET hosts
Limit: 10

Please note that the Limit-header is also applied when doing Stats. I'm not sure if there is any use for that, but thats the way MK Livestatus behaves. The following example will count how many of the 10 first hosts are up:

GET hosts
Stats: state = 0
Limit: 10

If using filters, the Limit-header limits the number of datasets actually being output. The following query outputs the first 10 hosts which are down:

GET hosts
Filter: state = 1
Limit: 10

8. Authorization

Since version 1.1.3, Livestatus supports addon developers by helping to implement authorization. You can let Livestatus decide whether a certain contact may see data or not. This is very simple to use. All you need to do is to add an AuthUser header to your query with the name of a Nagios contact as single argument. If you do that, Livestatus will only display data that name is a contact for - either directly or via a contact group. Example:

GET services
Columns: host_name description contacts
AuthUser: harri

In certain cases it would be possible to replace AuthUser with a Filter header. But that does not work (precisely) in all situations.

8.1. Configuration

If your addon uses AuthUser, the administrator has a way to configure authentication details via nagios.cfg - and thus can do this uniformely across all addons using Livestatus. Currently two configuration options are available. Both can be set either to strict or loose:

optiondefaultdescription
service_authorizationloose Nagios automatically regards a contact for a host also as a contact for all services of that host. We call this method loose. By setting it to strict, one must be an explicity contact of a service in order to see it.
Please note that Nagios makes all services that do not have any contact at all inherit all contacts of the host - regardless whether this option is set to strict or loose.
group_authorizationstrict Nagios lets a contact see a host group or service group only if he is contact for all members of that group. We call that method strict. By setting it to loose it will be sufficient to be a contact for at least one member of the group in order to see the group itself.

8.2. Tables supporting AuthUser

The following tables support the AuthUser header (others simply ignore it): hosts, services, hostgroups, servicegroup and log. The log-table applies the AuthUser only to entries of the log classes 1 (host and service alerts), 3 (notifications) and 4 (passive checks). All other classes are not affected.

8.3. Limitations

Currently the AuthUser-header only controls which rows of data are output and has no impact on list columns, such as the groupscolumn in the table services. This means that this column also lists service groups the contact might not be a contact for. This might be changed in a future version of Livestatus.

9. Waiting

Starting with version 1.1.3 Livestatus has a new and still experimental feature: Waiting. Waiting allows developers of addons to delay the execution of a query until a certain condition becomes true or a Nagios event happens. This allows the implementation of a new class of features in addons, for example:

  • An immediate update of a status display as soon as the status of any or one specific Nagios object changes.
  • A logfile ticker showing new log messages immediately.
  • An action button for rescheduling the next check of a service which displays the service not sooner than after it has been checked.

All that can be implemented without polling - and in a very simple way. All you have to do is to make up some new query headers:

headerargumentdescription
WaitObjectobject-name This header specifies an object within the table the query is about. For the table hosts, hostgroups, servicegroups, contacts and contactgroups this is simply the name of the object. For the table services it is the hostname followed by a space followed by the service description. Other tables do not support this header since they only contain immutable data sets.

Note: as of version 1.1.11i3 it is allowed to separate the host name and the service description with a semicolon. This allows host names to contain spaces.
WaitConditionfilter-condition This header specifies a filter on the object specified with WaitObject. The syntax and functionality is identical to the Filter-header. By setting a wait condition you delay the execution of the query until the conditional expression is true for the object specified with WaitObject.
Specifying multiple condition headers is allowed: All conditions are combined with a boolean and (just as with the Filter header).
WaitConditionOrinteger number This combines the previous N wait conditions into one new wait condition by applying a boolean or operation. It works exactly like Or.
WaitConditionAndinteger number This combines the previous N wait conditions into one new wait condition by applying a boolean and operation. It works exactly like And.
WaitConditionNegate(none) New in version 1.1.11i2: This logically negates the most recent wait condition.
WaitTriggerkeyword Livestatus neither uses polling nor a busy wait loop for waiting for the condition to become true. It rather makes use of Nagios' event broker messages. Each time a certain type of event happens, the condition is rechecked. By specifying a specific trigger, you help Livestatus to decide which events are relevant and thus avoid internal overhead. It is also possible to wait for a trigger alone - without a WaitCondition. You will find a table of allowed triggers later in this document.
WaitTimeoutinteger number This optional header specifies a timeout in milliseconds which imposes an upper limit on the time being waited. After the timeout the query is executed regardless of any wait condition. A value of 0 (which is the default) turns off the timeout.

The following triggers are available for the WaitTrigger-Header:

  • check - a service or host check has been executed
  • state - the state of a host or service has changed
  • log - a new message has been logged into nagios.log
  • downtime - a downtime has been set or removed
  • comment - a comment has been set or removed
  • command - an external command has been executed
  • program - a change in a global program setting, like enable_notifications
  • all - any of the upper events happen (this is the default)

9.1. Examples

Retrieve log messages since a certain timestamp, but wait until at least one new log message appeares:

GET log
Filter: time >= 1265062900
WaitTrigger: log

The same, but do not wait longer than 2 seconds:

GET log
Filter: time >= 1265062900
WaitTrigger: log
WaitTimeout: 2000

Retrieve the complete data about the host xabc123, but wait until its state is critical:

GET hosts
WaitObject: xabc123
WaitCondition: state = 2
WaitTrigger: state
Filter: host_name = xabc123

Get data about the service Memory used on host xabc123 as soon as it has been checked some time after 1265062900:

GET services
WaitObject: xabc123 Memory used
WaitCondition: last_check > 1265062900
WaitTrigger: check
Filter: host_name = xabc123
Filter: description = Memory used

10. Compensating timezone differences

When doing multi-national distributed monitoring with Livestatus you might have to deal with situations where your monitoring servers are running in different time zones. In an ordinary setup all servers will have the same system time but different configured time zones. You can check this by calling on each monitoring server:

user@host:~$ date +%s

This command should output the same value on all servers. If not, you've probably set your system to a wrong time zone. MK Livestatus can help to compensate the time difference in such situations. If you add the header

Localtime: 1269886384

to your query with your current local time (the output of date +%s) as an argument, Livestatus will compare its local time against that of the caller and convert all timestamps accordingly.

Please note that Livestatus assumes that a difference in time is not due to clock inaccuracy but due to timezone differences. The delta time computed for compensating will be rounded to the nearest half hour.

11. Response Header

If your request is not valid or some other error appears, a message is printed to the logfile of Nagios. If you want to write an API that displays error message to the user, you need information about errors as a part of the response.

You can get such behaviour by using the header ResponseHeader. It can be set to off (default) or to fixed16:

GET hosts
ResponseHeader: fixed16

Other types of response headers might be implemented in future versions. The fixed16-header has the advantage that it is exactly 16 bytes long. This makes it easy to program an API. You simply can read in 16 bytes and need not scan for a newline or stuff like that. Here is a complete example session with response headers being activated:

user@host:~$ unixcat /var/lib/nagios/rw/live
GET hirni
ResponseHeader: fixed16

404          43
Invalid GET request, no such table 'hirni'

The fixed16 response header has the following format:

  • Bytes 1-3: status code
  • Byte 4: a single space
  • Byte 5-15: the length of the response as an ASCII coded integer number, padded with spaces
  • Byte 16: a linefeed character (ASCII 10)

These are the possible values of the status code:

200OK. Reponse contains the queried data.
400The request contains an invalid header.
403The user is not authorized (see AuthHeader)
404The target of the GET has not been found (e.g. the table).
450A non-existing column was being referred to
451The request is incomplete.
452The request is completely invalid.

The reponse contains the queried data only if the status code is 200. In all other cases the reponse contains the error message. In that case the length field gives the length of the error message including the trailing linefeed. It is not JSON-encoded, even if you set that in the OutputFormat-header.

12. Keep alive (persistent connections)

MK Livestatus allows you to keep open a connection and reuse it for several requests. I order to do that you need to add the following header:

KeepAlive: on

Livestatus will keep open the connection after sending its response and wait for a new query in that case. You probably also will activate a response header in that case, since only this allows you to exactly determine the length of the response (without KeepAlive you can simply read until end of file).

KeepAlive: on
ResponseHeader: fixed16

Please note that keeping up a connection permanently occupies ressources within the Nagios process. In the current version Livestatus is limited to ten parallel persistent connections. This is different from the way persistent database connections are handled.

The proposed way to use persistent connections in web applications is to keep the connection open only during the current request and close it after the complete result page has been rendered. The reason is that bringing up a database connection is a much more costly operation than connecting to MK Livestatus.

13. Access to Logfiles

Since version 1.1.1 Livestatus provides transparent access to your Nagios logfiles, i.e. nagios.log and the rotated files in archives (you might have defined an alternative directory in nagios.cfg). Livestatus keeps an index over all log files and remembers which period of time is kept in which log file. Please note that Livestatus does not depend on the name of the log files (while Nagios does). This way Livestatus has no problem if the log file rotation interval is changed.

The Livestatus table log is your access to the logfiles. Every log message is represented by one row in that table.

13.1. Performance issues

If your monitoring system is running for a couple of years, the number of log files and entries can get very large. Each Livestatus query to the table log has the potential of scanning all historic files (although an in-memory cache tries to avoid reading files again and again). It is thus crucial that you use Filter: in order to restrict:

  • The time interval
  • The log classes in question

If you set no filter on the column time, then all logfiles will be loaded - regardless of other filters you might have set.

Setting a filter on the column class restricts the types of messages loaded from disk. The following classes are available:

  • 0 - All messages not in any other class
  • 1 - host and service alerts
  • 2 - important program events (program start, etc.)
  • 3 - notifications
  • 4 - passive checks
  • 5 - external commands
  • 6 - initial or current state entries
  • 7 - program state change

Werk #0336

Limit the number of lines read from a single logfile

14. RRD Files of PNP4Nagios

New in 1.1.9i3: In order to improve the integration between Multisite and PNP4Nagios, Livestatus introduces the new column pnpgraph_present in the tables hosts and services (and all other tables containing host_ or service_ columns). That column can have three possible values:

-1Information about PNP graphs is not available. The reason is that either the module option pnp_path is not specified or the directory it points to is not readable.
 0No PNP graph is avaiable for this host or service.
 1A PNP graph is available

Livestatus cannot detect the base directory to your RRD files automatically, so you need to configure it with the module option pnp_path:

nagios.cfg
broker_module=/usr/local/lib/mk-livestatus/livestatus.o \
  pnp_path=/var/lib/pnp4nagios/perfdata /var/lib/nagios/rw/live
event_broker_options=-1

In order to determine Livestatus the availability of the PNP graph it checks for the existance of PNPs .xml file.

A note for OMD users: OMD automatically configures this option correctly in etc/mk-livestatus/nagios.cfg. You need at least a daily snapshot of 2010-12-17 or later for using the new feature.

15. Expansion of macros

Nagios allows you to embed macros within your configuration. For example it is usual to embed $HOSTNAME$ and $SERVICEDESC$ into your action_url or notes_url when configuring links to a graphing tool.

As of version 1.1.1 Livestatus supports expansion of macros in several columns of the table hosts and services. Those columns - for example notes_url_expanded - bear the same name as the unexpanded columns but with _expanded suffixed.

Since macro expansion is very complex in Nagios. And unfortunately the Nagios code for that is not thread safe, so Livestatus has its own implementation of macros, which does not support all features of Nagios, but (nearly) all that are needed for visualization addons. Livestatus supports the following macros:

  • for hosts and services: HOSTNAME, HOSTDISPLAYNAME, HOSTALIAS, HOSTADDRESS, HOSTOUTPUT, LONGHOSTOUTPUT, HOSTPERFDATA, HOSTCHECKCOMMAND
  • for services: SERVICEDESC, SERVICEDISPLAYNAME, SERVICEOUTPUT, LONGSERVICEOUTPUT, SERVICEPERFDATA, SERVICECHECKCOMMAND
  • all custom macros on hosts and services (beginning with _HOST or _SERVICE)
  • all $USER...$ macros

16. Remote access to Livestatus via SSH or xinetd

16.1. Livestatus via SSH

Livestatus current does not provide a TCP socket. Another (and more secure) way for remote accessing the unix socket is using SSH. The following example sends a query via SSH. The only priviledge the remote user needs, is write access to the unix socket:

user@host:~$ ssh < query nagios@10.0.0.14 "unixcat /var/lib/nagios/rw/live"
ZWEI;NIC eth0 link;2
ZWEI;NIC eth0 parameter;2
Zwei;NIC eth0 link;2
Zwei;NIC eth0 parameter;2
laptop;Check_MK;2
laptop;Interface eth5;2
laptop;Interface eth6;2
laptop;Interface eth7;2
localhost;FILES_in_/bin;2

16.2. Livestatus via xinetd

Using xinetd and unixcat you can bind the socket of Livestatus to a TCP socket. Here is an example configuration for xinetd:

/etc/xinetd.d/livestatus
service livestatus
{
	type		= UNLISTED
	port		= 6557
	socket_type	= stream
	protocol	= tcp
	wait		= no
# limit to 100 connections per second. Disable 3 secs if above.
	cps             = 100 3
# set the number of maximum allowed parallel instances of unixcat.
# Please make sure that this values is at least as high as
# the number of threads defined with num_client_threads in
# etc/mk-livestatus/nagios.cfg
        instances       = 500
# limit the maximum number of simultaneous connections from
# one source IP address
        per_source      = 250
# Disable TCP delay, makes connection more responsive
	flags           = NODELAY
	user		= nagios
	server		= /usr/bin/unixcat
	server_args     = /var/lib/nagios/rw/live
# configure the IP address(es) of your Nagios server here:
#	only_from       = 127.0.0.1 10.0.20.1 10.0.20.2
	disable		= no
}

You can access your socket for example with netcat:

user@host:~$ netcat 10.10.0.141 6557 < query_6
4297;13;9;0

17. Timeouts

In version 1.1.7i3 the handling of timeouts has changed. There are now two configurable timeouts which protect Livestatus from broken clients hanging on the line for ever (remember that the maximum number of parallel connections is configurable but limited):

  • idle_timeout - Limits the time Livestatus waits for a (the next) query
  • query_timeout - Limits the time a query needs to be read

A Livestatus connection has two states: either Livestatus is waiting for a query. This is the case just after the client has connected, but also in KeepAlive-mode after the response has been sent. The client has now at most idle_timeout ms for starting the next query. The default is set to 300000 (300 seconds, i.e. 5 minutes). If a client is idle for more then that, Livestatus simply closes the connection.

As soon as the first byte of a query has been read, Livestatus enters the state "reading query" and uses a much shorter timeout: the query_timeout. Its default value is 10000 (10 secs). If the client does not complete the query within this time, the client is regarded dead and the connection is closed.

Both timeout values can be configured by Nagios module options in nagios.cfg. A timeout can be disable by setting its value to 0. But be warned: Broken clients can hang connections for ever and thus block Livestatus threads.

18. Sending commands via Livestatus

MK Livestatus supports sending Nagios commands. This is very similar to the Nagios command pipe, but very useful for accessing a Nagios instance via a remote connection.

You send commands via the basic request COMMAND followed by a space and the command line in exactly the same syntax as needed for the Nagios pipe. No further header fields are required nor allowed.

Livestatus keeps the connection open after a command and waits for further commands or GET-requests. It behaves like GET with KeepAlive: set to yes. That way you can send a bunch of commands in one connection - just as with the pipe. Here is an example of sending a command from the shell via unicat:

root@linux# echo "COMMAND [$(date +%s)] START_EXECUTING_SVC_CHECKS" \
     | unixcat /var/lib/nagios/rw/live

Just as with GET, a query is terminated either by closing the connection or by sending a newline. COMMAND automatically implies keep alive and behave like GET when KeepAlive is set to on. That way you can mix GET and COMMAND quries in one connection.

19. Stability and Performance

19.1. Stability

While early versions of MK Livestatus experienced some stability issues - not unusual for evolving software - nowadays it can be considered rock solid. There are no known problems with performance, crashes or a hanging Nagios, as long as two important requirements are fullfilled:

  • Environment macros have to be disabled in nagios.cfg. This is done with:

    nagios.cfg
    enable_environment_macros=0
    

  • The debug level of Livestatus is set to 0 (which is the default).

19.2. Performance

Livestatus behaves with respect to your CPU and disk ressources. It doesn't do any disk IO at all, in fact - as long as the table log is not accessed, which needs read access to the Nagios' log files. CPU is only consumed during actual and queries and even for large queries we rather speek of micro seconds then of milli seconds of CPU usage. Furthermore, Livestatus does not block Nagios during the execution of a query but is running totally in parallel - and scales to all available CPU cores if neccessary.

20. 1.1.9i3 Timeperiod transitions

Version 1.1.9i3 introduces a new little feature, that does not really have something to do with status queries but is very helpful for creating availability reports and was easy to implement in Livestatus (due to its timeperiod cache).

Each time a timeperiod changes from active to not active or vice versa, an entry in the Nagios logfile is being created. At start of Nagios the initial states of all timeperiods are also logged. This looks like this:

nagios.log
[1293030953] TIMEPERIOD TRANSITION: 24x7;-1;1
[1293030953] TIMEPERIOD TRANSITION: none;-1;0
[1293030953] TIMEPERIOD TRANSITION: workhours;-1;1

When a transition occurs one line is logged (here the state changed from 1 (in) to 0 (out).

[1293066460] TIMEPERIOD TRANSITION: workhours;1;0

With that information, it is later possible to determine, which timeperiods were active when an alert happened. That way you can make availability reports reflect only certain time periods.

21. Host and Service Availability

21.1. Introduction

Version 1.2.1i2 introduces the new table statehist which supports availability queries - providing statistical information for hosts and service. Beside the state information, this table returns duration information regarding the length of the state. In addition the duration percentage in respect to the query timeframe can be returned.

Each change creates an output line with the respective duration. Additional columns show the part (percentage) of this duration in comparison to the queried timeframe. To get the overall percentage of a specific state you can use the Stats: header to accumulate the percentage fields of multiple lines.

21.2. Absence of host and services

To identify the absence of host and services within the queried timeperiod correctly it is necessary to set the following parameter.

nagios.cfg
log_initial_states=1

Setting this parameter to 1 results that the initial state of each host and service is logged during the programs startup. By evaluating each startup it is possible to detect if a host or service is no longer monitored by the system. It is even possible to detect if this host or service was temporarily removed from monitoring for a specific time. The absence of a host or service is reflected in the output line within the state column as -1 (UNMONITORED)

The setting log_initial_states=1 is the default parameter as of version 1.2.1i2.
Disabling this parameter lead to less logfile entries on the programs startup, but limits the correct detection of the UNMONITORED state.

21.3. Table statehist

columndescription
host_name Host name
service_description Service description
state The state of the host or service in question
time Time of the log event (seconds since 1/1/1970)
from Start time of state (seconds since 1/1/1970)
until End time of state (seconds since 1/1/1970)
duration Duration of state (until - from)
duration_ok Duration of state OK (until - from)
duration_warning Duration of state WARNING (until - from)
duration_critical Duration of state CRITICAL (until - from)
duration_unknown Duration of state UNKNOWN (until - from)
duration_unmonitored Duration of state UNMONITORED (until - from)
duration_part Duration part compared to the query timeframe
duration_part_ok Duration part OK compared to the query timeframe
duration_part_warning Duration part WARNING compared to the query timeframe
duration_part_critical Duration part CRITICAL compared to the query timeframe
duration_part_unknown Duration part UNKNOWN compared to the query timeframe
duration_part_unmonitored Duration part UNMONITORED compared to the query timeframe
in_downtime Shows if the host or service is in downtime
in_host_downtime Shows if the host of this service is in downtime.
host_down Shows if the host of this service is in down.
is_flapping Shows if the host or service is flapping
in_notification_period Shows if the host or service is within the notification period
notification_period Shows host or service notification period
log_output Logfile output relevant for this line
current_host_* Joined data from host, only if host still exists
current_service_* Joined data from service, only if service still exists


Querying the table statehist results in an output which shows the a hosts/services states as mentioned above and in addition how long this host/service resided in this state.

Important:
A query always requires a filter for the start time. Otherwise livestatus would parse all available logfiles from the beginning, which might add up to several hundred megabytes.

query
GET statehist
Columns: host_name service_description state duration duration_part
Filter: host_name = klappkiste
Filter: service_description = CPU load
Filter: time >= 1348657741
Filter: time < 1348658033

Outputs a list where the state of this service has changed joined with the duration information

klappkiste;CPU load;0;76;2.6116838488e-01
klappkiste;CPU load;2;47;1.6151202749e-01
klappkiste;CPU load;0;20;6.8728522337e-02
klappkiste;CPU load;1;11;3.7800687285e-02
klappkiste;CPU load;2;29;9.9656357388e-02
klappkiste;CPU load;0;108;3.7113402062e-01


By using the Stats: header these lines can be accumulated and allows the output of distinct stats duration and their duration_part in respect to the queried timeframe.

query
GET statehist
Columns: host_name service_description state
Filter: host_name = klappkiste
Filter: service_description = CPU load
Filter: time >= 1348657741
Filter: time < 1348658033
Stats: sum duration
Stats: sum duration_part

Results in:

klappkiste;CPU load;0;204;7.0103092784e-01
klappkiste;CPU load;1;11;3.7800687285e-02
klappkiste;CPU load;2;76;2.6116838488e-01


Using the columns duration_part_ok, duration_part_warning, duration_part_critical allows to output the entire state information within a single line

query
GET statehist
Columns: host_name service_description
Filter: host_name = klappkiste
Filter: service_description = CPU load
Filter: time >= 1348657741
Filter: time < 1348658033
Stats: sum duration_ok
Stats: sum duration_warning
Stats: sum duration_critical
Stats: sum duration_part_ok
Stats: sum duration_part_warning
Stats: sum duration_part_critical

Results in:

klappkiste;CPU load;204;11;76;7.0103092784e-01;3.7800687285e-02;2.6116838488e-01

Converting the part values into percentages the SLA information for this service is

  • 70.1% OK
  • 3.8% WARNING
  • 26.1% CRITICAL