Troubleshooting Zabbix
All of the previous Q&As cover some of the most common issues new users might encounter. There are a lot of other issues you might run into, and with new versions of Zabbix, new issues will appear. While it's good to have quick solutions to common problems, let's look at some details that could be helpful when debugging Zabbix problems.
The Zabbix log file format
One of the first places we should check when there's an unexplained issue is log files. This is not just a Zabbix-specific thing; log files are great. Sometimes. Other times, they do not help, but we will discuss some other options for when log files do not provide the answer. To be able to find the answer, though, it is helpful to know some basics about the log file format. The Zabbix log format is as follows:
PPPPPP:YYYYMMDD:HHMMSS.mmm
Here, PPPPPP
is process ID, space-padded to six characters, YYYYMMDD
is the current date, HHMMSS
is the current time, and mmm
is milliseconds for the timestamp. Colons and the dot are literal symbols. This prefix is followed by a space and then by the actual log message. Here's an example log entry:
10372:20151223:134406.865 database is down: reconnecting in 10 seconds
If there's a line in the log file without this prefix, it is most likely coming from an external source, such as a script, or maybe from some library, such as Net-SNMP
.
During startup, output similar to the following will be logged:
3737:20181208:111546.489 Starting Zabbix Server. Zabbix 4.0.2 (revision 87228). 3737:20181208:111546.489 ****** Enabled features ****** 3737:20181208:111546.489 SNMP monitoring: YES 3737:20181208:111546.489 IPMI monitoring: YES 3737:20181208:111546.489 Web monitoring: YES 3737:20181208:111546.489 VMware monitoring: YES 3737:20181208:111546.489 SMTP authentication: YES 3737:20181208:111546.489 Jabber notifications: YES 3737:20181208:111546.489 Ez Texting notifications: YES 3737:20181208:111546.489 ODBC: YES 3737:20181208:111546.489 SSH2 support: YES 3737:20181208:111546.489 IPv6 support: YES 3737:20181208:111546.489 TLS support: YES 3737:20181208:111546.489 ****************************** 3737:20181208:111546.489 using configuration file: /etc/zabbix/zabbix_server.conf 3737:20181208:111546.500 current database version (mandatory/optional): 04000000/04000003 3737:20181208:111546.500 required mandatory version: 04000000
The first line prints out the daemon type and version. Depending on how it was compiled, it might also include the current SVN revision number. A list of the compiled-in features follows. This is very useful to know whether you should expect SNMP, IPMI, or VMware monitoring to work at all. Then, the path to the currently-used configuration file is shown—helpful when we want to figure out whether the file we changed was the correct one. In the server and proxy log files, both the current and the required database versions are present—we discussed those in Chapter 20, Zabbix Maintenance.
After the database versions, the internal process startup messages can be found:
3737:20181208:111546.507 server #0 started [main process] 3747:20181208:111546.517 server #6 started [timer #1] 3748:20181208:111546.518 server #7 started [http poller #1] 3743:20181208:111546.518 server #2 started [alerter #1] 3744:20181208:111546.518 server #3 started [alerter #2] 3745:20181208:111546.518 server #4 started [alerter #3] 3749:20181208:111546.519 server #8 started [discoverer #1] 3750:20181208:111546.529 server #9 started [history syncer #1] 3746:20181208:111546.529 server #5 started [housekeeper #1] 3742:20181208:111546.529 server #1 started [configuration syncer #1] 3769:20181208:111546.529 server #28 started [trapper #5] 3771:20181208:111546.531 server #30 started [alert manager #1] 3754:20181208:111546.532 server #13 started [escalator #1] 3756:20181208:111546.533 server #15 started [proxy poller #1] 3757:20181208:111546.535 server #16 started [self-monitoring #1] 3758:20181208:111546.535 server #17 started [task manager #1] 3761:20181208:111546.535 server #20 started [poller #3] 3764:20181208:111546.546 server #23 started [unreachable poller #1] 3765:20181208:111546.556 server #24 started [trapper #1] 3755:20181208:111546.558 server #14 started [snmp trapper #1] 3763:20181208:111546.558 server #22 started [poller #5] 3772:20181208:111546.570 server #31 started [preprocessing manager #1] 3766:20181208:111546.570 server #25 started [trapper #2] 3751:20181208:111546.572 server #10 started [history syncer #2] 3753:20181208:111546.572 server #12 started [history syncer #4] 3759:20181208:111546.572 server #18 started [poller #1] 3762:20181208:111546.584 server #21 started [poller #4] 3767:20181208:111546.594 server #26 started [trapper #3] 3768:20181208:111546.596 server #27 started [trapper #4] 3770:20181208:111546.598 server #29 started [icmp pinger #1] 3752:20181208:111546.599 server #11 started [history syncer #3] 3760:20181208:111546.599 server #19 started [poller #2] 3774:20181208:111547.136 server #33 started [preprocessing worker #2] 3773:20181208:111547.162 server #32 started [preprocessing worker #1] 3775:20181208:111547.162 server #34 started [preprocessing worker #3]
There will be many more lines like these; the output here is trimmed. This might help verify that the expected number of processes of some type has been started. When looking at log file contents, it is not always obvious which process logged a specific line, and this is where the startup messages can help. If we see a line such as the following, we can find out which process logged it:
21974:20151231:184520.117 Zabbix agent item "vfs.fs.size[/,free]" on host "A test host" failed: another network error, wait for 15 seconds
We can do that by looking for the startup message with the same PID:
# grep 21974 zabbix_server.log | grep started 21974:20151231:184352.921 server #8 started [unreachable poller #1]
Note
If more than one line is returned, apply common sense to find out the startup message.
This demonstrates that hosts are deferred to the unreachable poller after the first network failure.
But what if the log file has been rotated and the original startup messages are lost? Besides more advanced detective work, there's a simple method, provided that the daemon is still running. We will look at that method a bit later in the chapter runtime process status.
Reloading the configuration cache
We met the configuration cache in Chapter 2, Getting Your First Notification, and we discussed ways to monitor it in Chapter 20, Zabbix Maintenance. While it helps a lot performance-wise, it can be a bit of a problem if we are trying to quickly test something. It is possible to force the Zabbix server to reload the configuration cache.
Run the following to display the Zabbix server options:
# zabbix_server --help
Note
We briefly discussed Zabbix proxy configuration cache-reloading in Chapter 17, Using Proxies to Monitor Remote Locations.
In the output, look for the runtime control options section:
-R --runtime-control runtime-option Perform administrative functions Runtime control options: config_cache_reload Reload configuration cache
Thus, reloading the server configuration cache can be initiated by the following:
# zabbix_server --runtime-control config_cache_reload zabbix_server [2682]: command sent successfully
Examining the server log file will reveal that it has received the signal:
forced reloading of the configuration cache
In the background, the sending of the signal happens like this:
- The server binary looks up the default configuration file
- It then looks for the file specified in the
PidFile
option - It sends the signal to the process with that ID
As discussed in Chapter 17, Using Proxies to Monitor Remote Locations, the great thing about this feature is that it's also supported for active Zabbix proxies. Even better, when an active proxy is instructed to reload its configuration cache, it connects to the Zabbix server, gets all the latest configuration, and then reloads the local configuration cache. If such a signal is sent to a passive proxy, it ignores the signal.
What if you have several proxies running on the same system—how can you tell the binary which exact instance should reload the configuration cache? Looking back at the steps that were taken to deliver the signal to the process, all that is needed is to specify the correct configuration file. If running several proxies on the same system, each must have its own configuration file already, specifying different PID files, log files, listening ports, and so on. Instructing a proxy that used a specific configuration file to reload the configuration cache would be this simple:
# zabbix_proxy -c /path/to/zabbix_proxy.conf --runtime-control config_cache_reload
Note
The full or absolute path must be provided for the configuration file; a relative path is not supported. The same principle applies for servers and proxies, but it is even less common to run several Zabbix servers on the same system.
Manually reloading the configuration cache is useful if we have a large Zabbix server instance and have significantly increased the CacheUpdateFrequency
parameter.
Controlling running daemons
A configuration-cache reload was only one of the things available in the runtime section. Let's look at the remaining options in there:
housekeeper_execute Execute the housekeeper log_level_increase=target Increase log level, affects all processes if target is not specified log_level_decrease=target Decrease log level, affects all processes if target is not specified Log level control targets: pid Process identifier process-type All processes of specified type (for example, poller) process-type,N Process type and number (e.g., poller,3)
As discussed in Chapter 20, Zabbix Maintenance, the internal housekeeper is first run 30 minutes after the server or proxy startup. The housekeeper_execute runtime
option allows us to run it at will:
# zabbix_server --runtime-control housekeeper_execute
Even more interesting is the ability to change the log level for a running process. This feature first appeared in Zabbix 2.4, and it made debugging much, much easier. Zabbix daemons are usually started and just work—until we have to change something. While we cannot tell any of the daemons to reread their configuration file, there are a few more options that allow us to control some aspects of a running daemon. As briefly mentioned in Chapter 20, Zabbix Maintenance, the DebugLevel
parameter allows us to set the log level when the daemon starts, with the default being 3. Log level 4 adds all the SQL queries, and log level 5 also adds the received content from web monitoring and VMware monitoring.
For the uninitiated, anything above level 3 can be very surprising and intimidating. Even a very small Zabbix server can easily log dozens of megabytes in a few minutes at log level 4. As some problems might not appear immediately, you might have to run it for hours or days at log level 4 or 5. Imagine dealing with gigabytes of logs you are not familiar with. The ability to set the log level for a running process allows us to increase the log level during a problem situation and lower it later, without requiring a daemon restart.
Even better, when using the runtime log level feature, we can select which exact components should have their log level changed. Individual processes can be identified by either their system PID or by the process number inside Zabbix. Specifying processes by the system PID could be done like this:
# zabbix_server --runtime-control log_level_increase=1313
Specifying an individual Zabbix process is done by choosing the process type and then passing the process number:
# zabbix_server --runtime-control log_level_increase=trapper,3
A fairly useful and common approach is changing the log level for all processes of a certain type—for example, we don't know which trapper will receive the connection that causes the problem, so we could easily increase the log level for all trappers by omitting the process number:
# zabbix_server --runtime-control log_level_increase=trapper
And if no parameter is passed to this runtime option, it will affect all Zabbix processes:
# zabbix_server --runtime-control log_level_increase
When processes are told to change their log level, they log an entry about it and then change the log level:
21975:20151231:190556.881 log level has been increased to 4 (debug)
Note that there is no way to query the current log level or set a specific level. If you are not sure about the current log level of all the processes, there are two ways to sort it out:
Restart the daemon
Decrease or increase the log level 5 times so that it's guaranteed to be at 0
or 5
, then set the desired level. As a simple test of the options we just explored, increase the log level for all pollers:
# zabbix_server --runtime-control log_level_increase=poller
Open a tail on the Zabbix server logfile:
# tail -f /tmp/zabbix_server.log
Notice the amount of data that just 5 poller processes on a tiny Zabbix server can generate. Then decrease the log level:
# zabbix_server --runtime-control log_level_decrease=poller
Runtime process status
Zabbix has another small trick to help with debugging. Run top
and see which mode gives you a more stable and longer list of Zabbix processes—one of sorting by processor usage (hitting Shift + P) or memory usage (hitting Shift + M) might.
Note
Alternatively, hit o and type COMMAND=zabbix_server
.
Press C and notice how the Zabbix processes have updated their command line to show which exact internal process it is and what is it doing as we can see here:
zabbix_server: poller #1 [got 0 values in 0.000005 sec, idle 1 sec] zabbix_server: poller #4 [got 1 values in 0.000089 sec, idle 1 sec] zabbix_server: poller #5 [got 0 values in 0.000004 sec, idle 1 sec]
Follow their status and see how the task and the time it takes change for some of the processes. We could also have output that could be redirected or filtered through other commands:
# top -c -b | grep zabbix_server
The -c
option tells it to show the command line, the same thing we achieved by hitting C before. The -b
option tells top to run in batch mode without accepting input and just outputting the results. We could also specify -n 1
to run it only once or specify any other number as needed.
It might be more convenient to use ps
:
# ps -f -C zabbix_server
The -f
flag enables full output, which includes the command line. The -C
flag filters by the executable name:
zabbix 21969 21962 0 18:43 ? 00:00:00 zabbix_server: poller #1 [got 0 values in 0.000006 sec, idle 1 sec] zabbix 21970 21962 0 18:43 ? 00:00:00 zabbix_server: poller #2 [got 0 values in 0.000008 sec, idle 1 sec] zabbix 21971 21962 0 18:43 ? 00:00:00 zabbix_server: poller #3 [got 0 values in 0.000004 sec, idle 1 sec]
The full format prints out some extra columns—if all we needed was the PID and the command line, we could limit columns in the output with the -o
flag, like this:
# ps -o pid=,command= -C zabbix_server 21975 zabbix_server: trapper #1 [processed data in 0.000150 sec, waiting for connection] 21976 zabbix_server: trapper #2 [processed data in 0.001312 sec, waiting for connection]
Note
The equals sign after pid
and command
tells ps
not to use any header for these columns.
And to see a dynamic list that shows the current status, we can use the watch
command:
# watch -n 1 'ps -o pid=,command= -C zabbix_server'
This list will be updated every second. Note that the interval parameter, -n
, also accepts decimals, so to update twice every second, we could use -n 0.5
.
This is also the method to find out which PID corresponds to which process type if startup messages are not available in the log file—we can see the process type and PID in the output of top
or ps
.