Guidelines for writing checks for the official distribution
Last updated: March 04. 2011
Every check plugin that is part of Check_MK must have:
A check man page
A WATO rule set if the check has parameters
A metric definition with graphs and Perf-O-Meter if the check output performance data
A configuration for the agent bakery if the check has an agent plugin (CEE only)
An complete example agent output (agent based) or a cmk --snmpwalk (SNMP based) in MK's internal archive
The check types should be named short and unique. They must consist only of lower case characters, digits and underscores and begin with a lower case character.
Checks where one item of the check represents one thing (e.g. fan, power supply), should be named in singular, e.g. casa_fan, if, oracle_tablespace. Checks where each item checks a quantity, e.g. number of logins, should be named in plural (e.g. user_logins, printer_pages). Note: due to historic misconducts many existing check types are named contrarily to this rule. That does not mean that new checks should be named inconsitent, als well!
Vendor specific checks must be prefixed with a vendor specific unique abbreviation (which you think of). Example: fsc_ for Fujitsu Siemens Computers.
Product specific checks must be prefixed with a product abbreviation, for example steelhead_status for a Steelhead appliance of Riverbed.
SNMP based checks: if the check makes use of a standardized MIB which is or might be implemented by more than one vendor, then the check should not be named after the vendor but after the MIB. An example are the hr_* checks.
3. Check Layout
All checks must follow the same layout specified below:
4. Coding style
4.1. Add an author
4.2. Readability, looks and indents.
Avoid long lines. Ideally, your lines shouldn't exceed 100 chars.
4.3. File Header
For checks which are supposed to be part of the official Check_MK project the file header with the copyright information must be present. This will be automatically created if you call 'make headers' in the main source directory
4.4. Example agent output
Including example output of the agent is very helpful for understanding how the check parser works.
TCP-Agent based checks must include an output example of the agent. If the agent output can have different formats or output styles, then put an example for each kind of style the check supports (e.g.: the output of multipath -l has changed its layout between SLES 10 and SLES 11).
4.5. Use of lambda functions
When it comes to parse_function, inventory_function and check_function, the usage of lambda functions is only allowed in order to reuse existing functions while providing some additional argument. Example:
"inventory_function" : inventory_foobar_generic(info, "temperature")
It is not allowed to implement the function itself als lambda expression. Example:
# This is bad, ugly and unreadable code!! 'check_function' : lambda _no_item, _no_params, info: \ (0, "Memory used: %s" % get_bytes_human_readable(int(info))),
4.6. Loops over SNMP agent output
For SNMP checks that loop over the agent output, do not write:
for line in info: if line != '' and line ...
Always unpack into usefully named variables:
for sensor_id, state_state, foo, bar in info: if sensor_state != 1 and sensor_id ...
5. Configuration variables
Configuration variables for main.mk should be named after the check if they are only used by this check. This does not hold for variables, that are used by several checks (e.g. filesystem_default_levels is used by df, hr_fs, df_netapp, ...)
The variable that is used for the check's default parameters and entered in the inventory function must be named CHECKTYP_default_levels (if not used by more than one check, see above). Example: check foo_bar has the configuration variable foo_bar_default_levels.
If a check does not use check parameters, the inventory function must return None as parameter and the check function must name the parameter argument _no_params.
6. Other details / required practices
6.1. Setting default values for configuration variables
Default values for check parameters (e.g. switch_cpu_default_levels) must be chosen in a way that they make sense for everybody, not just for your special case. If case you are unsure, rather choose too loose than too tight levels. This helps avoid false alarms.
If you set default values, add a short comment about how you came to choose said values. If it is merely a rough estimate, document that it is, if you got them from a very specific source, document where you got them.
6.2. Reuse of configuration variables
6.3. Error handling
Your check should assume that the agent is always producing valid data. It should not try to handle cases when the agent output is broken. Reason: broken agent output is already handled by Check_MK via Python exceptions. Intercepting these exceptions in your check code makes debugging of broken outputs much more difficult.
6.4. int() vs. saveint() and float
vs. savefloat() int() will throw an exception if the argument is not a valid number string (or if it is empty). Check_MK will catch the exception and make the check result "UNKNOWN" with an appropriate error message. saveint(), however, will assume 0 if the argument cannot be converted to a valid integer.
Use saveint() in all cases when you know or suspect that your device may supply invalid data, but the check should work with the rest of the data and produce useful results. Disadvantage: you may never find out that the device has supplied invalid data, because the check wont tell you !
6.5. Interpretation of levels
Many checks have parameters defining warning and critical levels which are compared to an actual value. Please observe the following important rules and conventions if you are writting such checks.
Warning and critical levels should always be checked with >= and <=. Example: a check monitors the length of a mail queue. The critical upper level is at 100. This means that if the length is exactly 100, the check should already be critical. There might be a few exceptions to this where this wouldn't make sense.
If there are just upper or just lower levels, the imput fields of the WATO ruleset definitions for such levels must be labelled Warning at ______, and Critical at ______.
6.6. return versus yield
A check function producing several subresults (e.g. current usage and growth) must use the yield function for returning these results. On the other hand, check generating exactly one result must use return.
6.7. check_info[...] keys
7. Plugin output
Each check returns one line of text - the plugin output (or sometimes called check output). In order to unify things the output must be formated according to the following rules:
when returning measurement values, place exactly one space between the value and the unit (e.g. 17.3 V). Only exception: Put no space before a percent sign. (correct e.g. 89.4%).
When returning measurement values, name the names of the quantities in upper case, then add the value separated by a colon. Examples: Voltage: 24.5 V, Phase: negative, Flux-Capacitor: operational
Do not directly use return codes or cryptic return strings internal to the device. Instead, try to translate them to human readable messages. Example: Instead of routeMonitorFail use route monitor has failed
If you have an additional item specific description (e.g. interface checks: Interface 2 - [eth0]) then this description should be the first part of the plugin output in the follwoing format: [eth0] (up) MAC: xx:xx:xx:xx:xx:xx, 1 Gbit/s, in: 1.27 kB/s(0.0%), out: 582.79 B/s(0.0%)
Standard enumeration format in english language: It is very common to capitalize the 1st word in each sentence. Outputs like Swap used: ..., Total virtual memory used: ... are close to being two sentences, thus we always capitalize the first word.
8. Performance data
8.1. Format of Performance data
Always send int or float data as performance data. Do not attach a unit. Write temp instead of "%0.2fC" % temp!
If you need to omit fields in the middle of the data list (e.g. warn or crit), add a None instead, for example [("usage", usage, None, None, 0, size)]
If you need to omit fields at the end, simply omit them. Do not add trailing Nones.
Naming of performance data variables:
8.2. Performance data flag
8.3. PNP Graph definition
Each check returning performance data must have a dedicated PNP graph definition in pnp-templates. If the check has warning and critical levels, the graph must display these levels as yellow and red lines.
PNP graphs should always use the consolidation function MAX (there are some rare exceptions where only MIN makes sense).
However: the Average value which is printed in the labelling of the graph must use the consolidation function AVERAGE. Using MAX would compute the average of the maximum values - which is totally useless.
Each check returning performance data should have a Perf-O-Meter. For checks which are part of Check_MK the Perf-O-Meter must be defined in web/plugins/perfometer/check_mk.py. For third-party checks it should be defined in a separate file in web/plugins/perfometer.
8.5. SNMP based checks
Only use numeric OIDs in your checks. Name-based OIDs rely on MIB files and the check won't work when the MIB files are not in place. Always have your OIDs start with a root, for example: .126.96.36.199.4.1
Each check must have a check man page. This should be:
Information that must be contained in the check description:
10. Service Descriptions
Checks doing the same should always have the same (consistent) service description. Examples:
Here are some frequent errors and further mixed guidelines:
If your check is accompanied by an agent plugin, you should observe the following rules:
A check which does not get the information which is needed decide whether or not the check is OK, must simply return None. This can be the case when a check with an item can not found the data matching this item in the agent output or SNMP data. Another possible situation is when the data provided by the agent or SNMP is completely empty.
When a check returns None, Check_MK will produce an UNKNOWN state with a state output which tells the user that this thing could not be found.
The state markers (!) and (!!) must only be used in checks which can go warning or critical for several different reasons, like sub-checks.
Your check must also work with Nagios as Core. If you use functions or variables from *.include files then you must declare them in check_info in the key "includes" and you must then test our check with Nagios as the core.
SNMP based checks should always contain information about used MIBs and textual descriptions, e.g.
'snmp_info' : [(".188.8.131.52.184.108.40.206.1.1.1", [ OID_END, "2", # ENTITY-MIB::entPhysicalDescription "5", # ENTITY-MIB::entPhysicalClass "7", # ENTITY-MIB::entPhysicalName ]),
12. Forbidden Things
Never use a global import statement in a check file.
Do not use datetime for date/time parsing. Use time. It can do all you need, really!!!
The arguments params and info that the check / discovery function is being called with must not be modified!
Do not use any other modules, except: sys, os, time, socket
If you really need regular expressions, use the function regex(). Do not use re directly.
Neither the check function nor the inventory function may use the print command, or otherwise output any data to stdout or stderr, or communicate with the outside world in any other way. An rare exception to this are checks which need a dedicated data storage (such as logwatch: it keeps unread log messages in files).
13. Groups of checks checking the same quantity on various devices
13.1. Temperature checks
The item name should reflect the kind of temperature being monitored. Please refer to the following
table to make sure that the same kinds of temperatures get the same item.
To ensure that all temperature checks work in the same way, use the check_temperature function in temperature.include.
The check group should be temperature.
check_temperature can handle device levels and status in various ways configurable in the temperature WATO rule. Do not pass both device status and device levels to check_temperature - if a device provides levels, pass those and not the status.
Some devices can output temperature in various units, and specify which unit it is. In those cases, pass the temperature in the unit the device states, along with the unit as the dev_unit parameter to check_temperature.
Some devices have a very large number of similar temperature sensors, where one item per sensor would be unreasonable. (Dozens of ambient temperature sensors in a small device do not really provide more information than a single one.) In those cases, use the check_temperature_list function defined in temperature.include. Use the temperature check group just as you would for regular temperature checks.
13.2. Simple memory checks
Many devices report memory usage in a simple way: used and total memory in absolute terms, or, equivalently, used and free memory in absolute terms.
To ensure uniform behaviour, all these checks should use the check_memory function defined in memory.include.
The check group should be memory_simple. Note that this requires that the check has an item. For devices with no modules, (i.e. only one memory value) the item should be the empty string.
14. Check list
Some things to check before you release your check into the public (check your check ;-))
Make sure that all possible conditions under which the check goes warn or crit really work. You can do this by faking the agent output.
Are default levels (factory settings) documented? Why have you chosen this levels and not other ones?