Event Console - HA Replication
Last updated: November 15. 2012
1. High Availability and the Event Console
If you are using the Check_MK Event Console at an enterprise level then it's not unlikely that you want to have it highly available. Of course a common way to do this is setting up a HA cluster like Heartbeat or Pacemaker, introduce a common filesystem using DR:BD or an external storage and let the cluster software automatically manage processes, a service IP address and storage.
For those who do not like that approach the Event Console now offers an alternative HA setup, which has some advantages:
It also has the following disadvantages (and probably more), which I do not want to conceal from you:
2.1. Steps on the master server
2.2. Steps on the slave server
2.3. Steps on your snmptrap / syslogd
If everything is working then you should always see the same list of events on the Multisite-GUI of the slave server then that in the master - with a short delay of up to a couple of seconds.
3. How it works
3.1. Regular Synchronization
As soon as you enable replication in the global settings of an Event Console you make it a replication slave. It will now pull the current event state as well as the configured rules and actions from its master on a regular periodic base. Other global settings will not be replicated. If you are working with service levels then please make sure that the same levels are designed in the master and slave.
Replication is done via TCP. The slave builds up a new connection for each synchronization. The reasons for this approach are:
If the replication is correctly working then you will always see an (almost) up-to-date event status in the Multisite GUI of the slave - ableit with some restrictions:
3.2. Manual and automatic takeover
What happens if the slave cannot reach and successfully synchronize with the master? Well, that's up to you. You can either let the slave automatically takeover if the master is not reachable for a certain time span. Or you can do a manual takeover with a button in the new page Server Status in the WATO module for the Event Console:
This brings the slave from the sync mode into the takeover mode. In this mode there are two differences:
Please note: by taking over the slave does not become a master! It is still using the configured rules and actions that it fetched from the master. The master does not become a slave when it is back again. The slave console's task is to be an intermediate solution until the master is available again - not the new master.
3.3. Manual and automatic fall back
Once your master console is back and running again you might want to switch back to it. Well - switching to it might not be the correct term. The master does not even know that there is a slave console. As soon as it's back it will continue processing messages anyway - not knowing anything about messages that it has lost meanwhile. Switching is something just the slave has to do.
Switching back from takeover to sync mode can be done manually at the same place as switching forth. The slave will then lock actions and updates of events, stop processing new events and try to synchronize with the master again. How an automatic fall back can be configured I'll show later.
Note: When switching back to sync mode any new events and updates of existing events that happened on the slave while being in takeover mode will be lost. If this is not an option for you then you can manually transfer back the updated event status from the slave to the master:
4. Further Configuration
As promised I'll say a few words on the further possible configuration options now:
4.1. Replication Interval
Here you can configure how ofther the slave synchronizes with the master. Please note that the synchronization is not incremental but always the complete event status is being sent (i.e. all open, acknowledged, counting and delayed events). Also after each reload of the master all rules and actions will once be transferred. If your typical number of current events is low then you can afford a faster synchronization rate which leads to a more up-to-date slave in case of a crash. How long a synchronization takes in your setup (including network and processing time) is being displayed in the sidebar snapin Event Console Performance.
4.2. Connection Timeout
4.3. Automatic takeover
If you enable this option then the slave will automatically takeover and enable event processing if the master is unreachable for the configured number of seconds. The idea is to set this time to a value where a short restart of the master will not yet lead to a takeover.
Note: The resolution of this setting is defined by the synchronization interval. After each failed synchronization the slave simply checks how long the most previous successful synchronization is ago and compares it with this setting.
4.4. Automatic fallback
If the slave is in takeover mode and this option is enabled, then it will still try to synchronize on the master. If this is succesful within the configure time span then the slave switches automatically back to sync mode. The idea behind this is to have the slave tolerate short downtimes of the master automatically while avoiding an automatic fall back (and possible loss of event status data) after a longer downtime.
4.5. Currently disable replication
4.6. Log replication events
With this option enabled every replication action (not only problems and mode switches) will be logged into the daemon log file (OMD: var/log/mkeventd.log). If you have a short replication interval then this can produce lots of messages. That's why it's off by default. We propose enabling the extended logging and watch the logfile while you setup the replication and during a test phase. In production mode better switch it off.