Monitoring VMWare ESX with Check_MK
Last updated: April 30. 2013
Monitoring VMWare ESX host sytems and virtual machines has always been difficult, but with the introduction of ESXi and the strangulation of the command line and the /proc filesystem it is now impossible to install any (useful) software on a host system.
Some people have tried SNMP, and there even is an SNMP agent that can be activated using the command line. But SNMP is deprecated, unsupported and - what is worst - just provides few and mostly useless data.
What remains is the "vSphere API". This is in fact a HTTP based protocol for quering and managing either a standalone host system or a vCenter. For use with Linux there exists a Perl API and one in Python that both make the underlying protocol available to script programming. Traditional Nagios plugins like check_esx3.pl or check_vmware_api.pl make use of the Perl API. They all have one major drawback: they consume immense CPU ressources on your monitoring host. Some people even setup a second monitoring server just to offload the ESX checks! The problem is rooted in the Perl API begin not very fast and - most important - each single metric has to be queried with a separate call of the check plugin.
As of version 1.2.3i1 Check_MK now overcomes these problems by implementing a completely new monitoring plugin for VMWare ESX and ESXi: the Check_MK vSphere Agent. Of course it is not only much faster but also supports automatic service discovery (inventory). As usual with Check_MK one call of the plugin per check interval per host system (or per vCenter) is sufficient.
Currently the agent support the following metrics:
2. Monitoring Host Systems
2.1. 1. Prerequisites
Before you can start you need to make sure that...
Note1: Currently the agent does not support ESX in version 4.1 or earlier. This version misses some information. It's probably not a big deal to account for that in the agent. Patches are welcome ;-)
Note2: You can get PySphere from http://pysphere.googlecode.com. Please read the enclosed README file for how to install PySphere. If you are using OMD then the good news is: everything is already in place, do not download PySphere, do not install it.
3. Doing the configuration with WATO
Now you can add your ESX hosts (not the VMs for now) to Check_MK. If you are using WATO then please specify Check_MK Agent (Server) as Agent type, even if there is no Check_MK agent on your ESX host. Command line users simply add the host to all_hosts. Lets assume that its name is esxhost01.
Then you need to configure to use the vSphere special agent instead of the normal Check_MK access as a datasource programm. This "special agent" is a Python program that is running locally on your Check_MK server. In WATO this is done with the rule Datasource Programs / Check state of VMWare ESX via vSphere.
Besides the obvious user name and password for your VMware user you can select an alternative TCP port (rarely used), a timeout for connecting to the ESX host and - most important - the list of data sources that you want to monitor. Please note that even if the new Check_MK ESX agent might be the fastest way to monitor ESX with Nagios, the check will take some time anyway - especially if your ESX host is busy and has many VMs running on it. In that case you could decide to remove some of the information in order to speed up the monitoring.
Also important is Type of query. Here you have the following choices:
Queried host is a host system: Most common case: you directly query an ESX host system.
Queried host is the vCenter: You query a vCenter. You'll get information about all host systems in VMs in this case - at the price of a longer check execution time.
Queried host is the vCenter with Check_MK Agent installed: Same, but also the Check_MK agent on the host is being queried. Note: the vCenter most probably is running on a Windows machine. If you want to monitor that machine with Check_MK as well, then select this option.
4. Setup without WATO
If you do not like WATO you can also setup the thing on the command line. Create a rule in datasource_programs in that case. The program to call is agent_vsphere and you'll find it in /usr/share/check_mk/agents/special (Default path for manual installations) or share/check_mk/agents/special (OMD users).
You can call this program manually with --help, if you want:
OMD[mysite]:~$ share/check_mk/agents/special/agent_vsphere --help Check_MK vSphere Agent USAGE: agent_vsphere [OPTIONS] HOST agent_vsphere -h ARGUMENTS: HOST Host name or IP address of vCenter or VMWare HostSystem OPTIONS: -h, --help Show this help message and exit -u USER, --user USER Username for vSphere login -s SECRET, --secret SECRET Secret/Password for vSphere login -D, --direct Assume a directly queried host system (no vCenter). In This we expect data about only one HostSystem to be Found and do not create piggy host data for that host. -H, --hostname Specify a hostname. This is neccessary if this is different from HOST. It is being used in --direct mode as the name of the host system when outputting its power state. -a, --agent Also retrieve data from the normal Check_MK Agent. This makes sense if you query a vCenter that is Installed on a Windows host that you also want to Monitor with Check_MK. -t, --timeout SECS Set the network timeout to vSphere to SECS seconds. This is also used when connecting the agent (option -a). Default is 60 seconds. Note: the timeout is not only applied to the connection, but also to each individual subquery. --debug Debug mode: let Python exceptions come through -i MODULES, --modules MODULES Modules to query. This is a comma separated list of hostsystem, virtualmachine and storage. Default is to query all modules.
Here is an example how to call this program:
OMD[mysite]:~$ share/check_mk/agents/special/agent_vsphere -u 'harri' -s 'EnIgMa' \ -i hostsystem,virtualmachine,datastore,counters --direct \ --hostname 'esxhost01' --timeout 5 10.1.1.111 <<<check_mk>>> Version: 5.0 AgentOs: VMware ESXi <<<esx_vsphere_datastores:sep(9)>>> [esxabc01-lds] accessible True capacity 578478407680 freeSpace 388398841856 type VMFS uncommitted 51973812224
If you've got your command line right you can add this to main.mk:
datasource_programs.append(( "share/check_mk/agents/special/agent_vsphere -u 'harri' -s 'EnIgMa' " "-i hostsystem,virtualmachine,datastore,counters --direct " "--hostname '<HOST>' --timeout 5 <IP>", [ "esxhost01" ] ))
After that you should be able to do an inventory as usual:
OMD[mysite]:~$ cmk -I esxhost01 Check_mk version 2013.04.25 Calling external program /omd/sites/esx/share/check_mk/agents/special/agent_vsph CPU utilization OK - 1.5% used, 15min average: 0.7%, 0.48GHz/32.78GHz, 2 so Disk IO SUMMARY OK - 12.00kB/sec read, 45.00kB/sec write, IOs: 79.00/sec Hardware Sensors CRIT - VMware Rollup Health State: Red (Sensor is operating HostSystem esx OK - power state: poweredOn Interface 0 OK - [vmnic0] (up) speed unknown, in: 8.2KBit/s(0.0%/1GBit/ Interface 1 OK - [vmnic1] (up) speed unknown, in: 0Bit/s(0.0%/1GBit/s), Interface 2 OK - [vmnic2] (up) speed unknown, in: 0Bit/s(0.0%/1GBit/s), Interface 3 OK - [vmnic3] (up) speed unknown, in: 0Bit/s(0.0%/1GBit/s), Memory used OK - 59% used - 14.21GB/23.99GB Overall state OK - Enity state: green, Power state: poweredOn Uptime OK - up since Mon Apr 8 09:10:00 2013 (22d 05:37:24) VM LinuxI WARN - power state: poweredOff, running on [esx.mathias-ket VM LinuxII.foobar.de WARN - power state: poweredOff, running on [esx.mathias-ket VM LinuxIII WARN - power state: poweredOff, running on [esx.mathias-ket VM LinuxIV WARN - power state: poweredOff, running on [esx.mathias-ket VM LinuxV WARN - power state: poweredOff, running on [esx.mathias-ket VM OpenSUSE_I OK - power state: poweredOn, running on [esx.mathias-kettne VM OpenSUSE_II OK - power state: poweredOn, running on [esx.mathias-kettne VM OpenSUSE_III OK - power state: poweredOn, running on [esx.mathias-kettne VM OpenSUSE_IV OK - power state: poweredOn, running on [esx.mathias-kettne VM OpenSUSE_V OK - power state: poweredOn, running on [esx.mathias-kettne VM WindowsXP OK - power state: poweredOn, running on [esx.mathias-kettne fs_zmucvm99-lds OK - 32.9% used (177.03 of 538.8 GB), (levels at 80.0/90.0% OK - Agent version 5.0, execution time 5.0 sec|execution_time=5.034
5. Monitoring Virtual Machines
So far we've just monitored the physical host systems. But if you have setup the Check_MK vSphere agent like in our example then you're just a small step away from monitoring the VMs. You just need to know the names of the VMs (as configured in vCenter, not the DNS names) and then add hosts with exactly that names to the monitoring.
Note: if no Check_MK agent is being installed in the virtual machines then you need to set the agent type to No Agent. main.mk users do this by adding the tag |ping to the host.
Then you just do an inventory on the VMs and you are done. Check_MK will use data the has come piggy back from the ESX host. But the services themselves will be attached to the according VM hosts - just as you most probably want it. Here is an example from the command line:
OMD[mysite]:~$ cmk -I vm_guest01 esx_vsphere_vm.cpu 1 new checks esx_vsphere_vm.heartbeat 1 new checks esx_vsphere_vm.mem_usage 1 new checks esx_vsphere_vm.name 1 new checks OMD[mysite]:~$ cmk -v vm_guest01 Check_mk version 2013.04.25 ESX CPU OK - demand is 0.009 Ghz, 1 virtual CPUs ESX Heartbeat OK - Heartbeat status is green ESX Memory OK - Host: 2.21GB, Guest: 0.00B, Ballooned: 0.00B, Private: ESX Name OK - OpenSUSE_V OK - execution time 0.0 sec|execution_time=0.001
5.1. Piggy back translation
Sometimes the naming scheme of your virtual machines does not really match that of your hosts in Check_MK. For that case Check_MK provides a rule called Hostname translation for piggybacked hosts. In WATO simply search for piggy in the rule search box. The rule translates ESX names into other names in a flexible way. Configure this rule for the vCenter/ESX host the is queried - not for the VMs!
The translation is done in four optional steps:
5.2. Debugging piggy back
If you wonder why some host is missing in your monitoring you can have a look at the directory where the piggy-backed data is being stored. In OMD this is in tmp/check_mk/piggyback. In manual installations this should be in parallel to the cache directory of Check_MK. Below piggyback this should be one directory of each VM that resides on one of your monitored ESX hosts.