Packt+ | Advance your knowledge in tech

You're reading from Zabbix 4 Network Monitoring Monitor the performance of your network devices and applications using the all-new Zabbix 4.0

Product type Paperback

Published in Jan 2019

Publisher

ISBN-13 9781789340266

Length 798 pages

Edition 3rd Edition

Languages

Java

Tools

Zabbix

Concepts

Network Monitoring

Author (1):

Patrik Uytterhoeven

View More author details

Table of Contents (29) Chapters

Title Page

About Packt

Contributors

Preface

1. Getting Started with Zabbix FREE CHAPTER

2. Getting Your First Notification

3. Monitoring with Zabbix Agents and Basic Protocols

4. Monitoring SNMP Devices

5. Managing Hosts, Users, and Permissions

6. Detecting Problems with Triggers

7. Acting upon Monitored Conditions

8. Simplifying Complex Configurations with Templates

9. Visualizing Data with Screens and Slideshows

10. Advanced Item Monitoring

11. Automating Configuration

12. Monitoring Web Pages

13. High-Level Business Service Monitoring

14. Monitoring IPMI Devices

15. Monitoring Java Applications

16. Monitoring VMware

17. Using Proxies to Monitor Remote Locations

18. Encrypting Zabbix Traffic

19. Working Closely with Data

20. Zabbix Maintenance

1. Troubleshooting

2. Being Part of the Community

3. Assessment

4. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Troubleshooting Zabbix

All of the previous Q&As cover some of the most common issues new users might encounter. There are a lot of other issues you might run into, and with new versions of Zabbix, new issues will appear. While it's good to have quick solutions to common problems, let's look at some details that could be helpful when debugging Zabbix problems.

The Zabbix log file format

One of the first places we should check when there's an unexplained issue is log files. This is not just a Zabbix-specific thing; log files are great. Sometimes. Other times, they do not help, but we will discuss some other options for when log files do not provide the answer. To be able to find the answer, though, it is helpful to know some basics about the log file format. The Zabbix log format is as follows:

PPPPPP:YYYYMMDD:HHMMSS.mmm

Here, PPPPPP is process ID, space-padded to six characters, YYYYMMDD is the current date, HHMMSS is the current time, and mmm is milliseconds for the timestamp. Colons and the dot are literal symbols. This prefix is followed by a space and then by the actual log message. Here's an example log entry:

10372:20151223:134406.865 database is down: reconnecting in 10 seconds

If there's a line in the log file without this prefix, it is most likely coming from an external source, such as a script, or maybe from some library, such as Net-SNMP.

During startup, output similar to the following will be logged:

3737:20181208:111546.489 Starting Zabbix Server. Zabbix 4.0.2 (revision 87228).
 3737:20181208:111546.489 ****** Enabled features ******
 3737:20181208:111546.489 SNMP monitoring: YES
 3737:20181208:111546.489 IPMI monitoring: YES
 3737:20181208:111546.489 Web monitoring: YES
 3737:20181208:111546.489 VMware monitoring: YES
 3737:20181208:111546.489 SMTP authentication: YES
 3737:20181208:111546.489 Jabber notifications: YES
 3737:20181208:111546.489 Ez Texting notifications: YES
 3737:20181208:111546.489 ODBC: YES
 3737:20181208:111546.489 SSH2 support: YES
 3737:20181208:111546.489 IPv6 support: YES
 3737:20181208:111546.489 TLS support: YES
 3737:20181208:111546.489 ******************************
 3737:20181208:111546.489 using configuration file: /etc/zabbix/zabbix_server.conf
 3737:20181208:111546.500 current database version (mandatory/optional): 04000000/04000003
 3737:20181208:111546.500 required mandatory version: 04000000

The first line prints out the daemon type and version. Depending on how it was compiled, it might also include the current SVN revision number. A list of the compiled-in features follows. This is very useful to know whether you should expect SNMP, IPMI, or VMware monitoring to work at all. Then, the path to the currently-used configuration file is shown—helpful when we want to figure out whether the file we changed was the correct one. In the server and proxy log files, both the current and the required database versions are present—we discussed those in Chapter 20, Zabbix Maintenance.

After the database versions, the internal process startup messages can be found:

  3737:20181208:111546.507 server #0 started [main process]
  3747:20181208:111546.517 server #6 started [timer #1]
  3748:20181208:111546.518 server #7 started [http poller #1]
  3743:20181208:111546.518 server #2 started [alerter #1]
  3744:20181208:111546.518 server #3 started [alerter #2]
  3745:20181208:111546.518 server #4 started [alerter #3]
  3749:20181208:111546.519 server #8 started [discoverer #1]
  3750:20181208:111546.529 server #9 started [history syncer #1]
  3746:20181208:111546.529 server #5 started [housekeeper #1]
  3742:20181208:111546.529 server #1 started [configuration syncer #1]
  3769:20181208:111546.529 server #28 started [trapper #5]
  3771:20181208:111546.531 server #30 started [alert manager #1]
  3754:20181208:111546.532 server #13 started [escalator #1]
  3756:20181208:111546.533 server #15 started [proxy poller #1]
  3757:20181208:111546.535 server #16 started [self-monitoring #1]
  3758:20181208:111546.535 server #17 started [task manager #1]
  3761:20181208:111546.535 server #20 started [poller #3]
  3764:20181208:111546.546 server #23 started [unreachable poller #1]
  3765:20181208:111546.556 server #24 started [trapper #1]
  3755:20181208:111546.558 server #14 started [snmp trapper #1]
  3763:20181208:111546.558 server #22 started [poller #5]
  3772:20181208:111546.570 server #31 started [preprocessing manager #1]
  3766:20181208:111546.570 server #25 started [trapper #2]
  3751:20181208:111546.572 server #10 started [history syncer #2]
  3753:20181208:111546.572 server #12 started [history syncer #4]
  3759:20181208:111546.572 server #18 started [poller #1]
  3762:20181208:111546.584 server #21 started [poller #4]
  3767:20181208:111546.594 server #26 started [trapper #3]
  3768:20181208:111546.596 server #27 started [trapper #4]
  3770:20181208:111546.598 server #29 started [icmp pinger #1]
  3752:20181208:111546.599 server #11 started [history syncer #3]
  3760:20181208:111546.599 server #19 started [poller #2]
  3774:20181208:111547.136 server #33 started [preprocessing worker #2]
  3773:20181208:111547.162 server #32 started [preprocessing worker #1]
  3775:20181208:111547.162 server #34 started [preprocessing worker #3]

There will be many more lines like these; the output here is trimmed. This might help verify that the expected number of processes of some type has been started. When looking at log file contents, it is not always obvious which process logged a specific line, and this is where the startup messages can help. If we see a line such as the following, we can find out which process logged it:

21974:20151231:184520.117 Zabbix agent item "vfs.fs.size[/,free]" on host "A test host" failed: another network error, wait for 15 seconds

We can do that by looking for the startup message with the same PID:

# grep 21974 zabbix_server.log | grep started
21974:20151231:184352.921 server #8 started [unreachable poller #1]

Note

If more than one line is returned, apply common sense to find out the startup message.

This demonstrates that hosts are deferred to the unreachable poller after the first network failure.

But what if the log file has been rotated and the original startup messages are lost? Besides more advanced detective work, there's a simple method, provided that the daemon is still running. We will look at that method a bit later in the chapter runtime process status.

Reloading the configuration cache

We met the configuration cache in Chapter 2, Getting Your First Notification, and we discussed ways to monitor it in Chapter 20, Zabbix Maintenance. While it helps a lot performance-wise, it can be a bit of a problem if we are trying to quickly test something. It is possible to force the Zabbix server to reload the configuration cache.

Run the following to display the Zabbix server options:

# zabbix_server --help

Note

We briefly discussed Zabbix proxy configuration cache-reloading in Chapter 17, Using Proxies to Monitor Remote Locations.

In the output, look for the runtime control options section:

-R --runtime-control runtime-option  Perform administrative functions  Runtime control options:
config_cache_reload  Reload configuration cache

Thus, reloading the server configuration cache can be initiated by the following:

# zabbix_server --runtime-control config_cache_reload
zabbix_server [2682]: command sent successfully

Examining the server log file will reveal that it has received the signal:

forced reloading of the configuration cache

In the background, the sending of the signal happens like this:

The server binary looks up the default configuration file
It then looks for the file specified in the PidFile option
It sends the signal to the process with that ID

As discussed in Chapter 17, Using Proxies to Monitor Remote Locations, the great thing about this feature is that it's also supported for active Zabbix proxies. Even better, when an active proxy is instructed to reload its configuration cache, it connects to the Zabbix server, gets all the latest configuration, and then reloads the local configuration cache. If such a signal is sent to a passive proxy, it ignores the signal.

What if you have several proxies running on the same system—how can you tell the binary which exact instance should reload the configuration cache? Looking back at the steps that were taken to deliver the signal to the process, all that is needed is to specify the correct configuration file. If running several proxies on the same system, each must have its own configuration file already, specifying different PID files, log files, listening ports, and so on. Instructing a proxy that used a specific configuration file to reload the configuration cache would be this simple:

# zabbix_proxy -c /path/to/zabbix_proxy.conf --runtime-control config_cache_reload

Note

The full or absolute path must be provided for the configuration file; a relative path is not supported. The same principle applies for servers and proxies, but it is even less common to run several Zabbix servers on the same system.

Manually reloading the configuration cache is useful if we have a large Zabbix server instance and have significantly increased the CacheUpdateFrequency parameter.

Controlling running daemons

A configuration-cache reload was only one of the things available in the runtime section. Let's look at the remaining options in there:

housekeeper_execute        Execute the housekeeper
log_level_increase=target  Increase log level, affects all processes if target is not specified
log_level_decrease=target  Decrease log level, affects all processes if target is not specified
Log level control targets: pid
Process identifier process-type All processes of specified type (for example, poller)
process-type,N           Process type and number (e.g., poller,3)

As discussed in Chapter 20, Zabbix Maintenance, the internal housekeeper is first run 30 minutes after the server or proxy startup. The housekeeper_execute runtime option allows us to run it at will:

# zabbix_server --runtime-control housekeeper_execute

Even more interesting is the ability to change the log level for a running process. This feature first appeared in Zabbix 2.4, and it made debugging much, much easier. Zabbix daemons are usually started and just work—until we have to change something. While we cannot tell any of the daemons to reread their configuration file, there are a few more options that allow us to control some aspects of a running daemon. As briefly mentioned in Chapter 20, Zabbix Maintenance, the DebugLevel parameter allows us to set the log level when the daemon starts, with the default being 3. Log level 4 adds all the SQL queries, and log level 5 also adds the received content from web monitoring and VMware monitoring.

For the uninitiated, anything above level 3 can be very surprising and intimidating. Even a very small Zabbix server can easily log dozens of megabytes in a few minutes at log level 4. As some problems might not appear immediately, you might have to run it for hours or days at log level 4 or 5. Imagine dealing with gigabytes of logs you are not familiar with. The ability to set the log level for a running process allows us to increase the log level during a problem situation and lower it later, without requiring a daemon restart.

Even better, when using the runtime log level feature, we can select which exact components should have their log level changed. Individual processes can be identified by either their system PID or by the process number inside Zabbix. Specifying processes by the system PID could be done like this:

# zabbix_server --runtime-control log_level_increase=1313

Specifying an individual Zabbix process is done by choosing the process type and then passing the process number:

# zabbix_server --runtime-control log_level_increase=trapper,3

A fairly useful and common approach is changing the log level for all processes of a certain type—for example, we don't know which trapper will receive the connection that causes the problem, so we could easily increase the log level for all trappers by omitting the process number:

# zabbix_server --runtime-control log_level_increase=trapper

And if no parameter is passed to this runtime option, it will affect all Zabbix processes:

# zabbix_server --runtime-control log_level_increase

When processes are told to change their log level, they log an entry about it and then change the log level:

21975:20151231:190556.881 log level has been increased to 4 (debug)

Note that there is no way to query the current log level or set a specific level. If you are not sure about the current log level of all the processes, there are two ways to sort it out:

Restart the daemon

Decrease or increase the log level 5 times so that it's guaranteed to be at 0 or 5, then set the desired level. As a simple test of the options we just explored, increase the log level for all pollers:

# zabbix_server --runtime-control log_level_increase=poller

Open a tail on the Zabbix server logfile:

# tail -f /tmp/zabbix_server.log

Notice the amount of data that just 5 poller processes on a tiny Zabbix server can generate. Then decrease the log level:

# zabbix_server --runtime-control log_level_decrease=poller

Runtime process status

Zabbix has another small trick to help with debugging. Run top and see which mode gives you a more stable and longer list of Zabbix processes—one of sorting by processor usage (hitting Shift + P) or memory usage (hitting Shift + M) might.

Note

Alternatively, hit o and type COMMAND=zabbix_server.

Press C and notice how the Zabbix processes have updated their command line to show which exact internal process it is and what is it doing as we can see here:

zabbix_server: poller #1 [got 0 values in 0.000005 sec, idle 1 sec]
zabbix_server: poller #4 [got 1 values in 0.000089 sec, idle 1 sec]
zabbix_server: poller #5 [got 0 values in 0.000004 sec, idle 1 sec]

Follow their status and see how the task and the time it takes change for some of the processes. We could also have output that could be redirected or filtered through other commands:

# top -c -b | grep zabbix_server

The -c option tells it to show the command line, the same thing we achieved by hitting C before. The -b option tells top to run in batch mode without accepting input and just outputting the results. We could also specify -n 1 to run it only once or specify any other number as needed.

It might be more convenient to use ps:

# ps -f -C zabbix_server

The -f flag enables full output, which includes the command line. The -C flag filters by the executable name:

zabbix   21969 21962  0 18:43 ?        00:00:00 zabbix_server: poller #1 [got 0 values in 0.000006 sec, idle 1 sec]
zabbix   21970 21962  0 18:43 ?        00:00:00 zabbix_server: poller #2 [got 0 values in 0.000008 sec, idle 1 sec]
zabbix   21971 21962  0 18:43 ?        00:00:00 zabbix_server: poller #3 [got 0 values in 0.000004 sec, idle 1 sec]

The full format prints out some extra columns—if all we needed was the PID and the command line, we could limit columns in the output with the -o flag, like this:

# ps -o pid=,command= -C zabbix_server
21975 zabbix_server: trapper #1 [processed data in 0.000150 sec, waiting for connection]
21976 zabbix_server: trapper #2 [processed data in 0.001312 sec, waiting for connection]

Note

The equals sign after pid and command tells ps not to use any header for these columns.

And to see a dynamic list that shows the current status, we can use the watch command:

# watch -n 1 'ps -o pid=,command= -C zabbix_server'

This list will be updated every second. Note that the interval parameter, -n, also accepts decimals, so to update twice every second, we could use -n 0.5.

This is also the method to find out which PID corresponds to which process type if startup messages are not available in the log file—we can see the process type and PID in the output of top or ps.

The rest of the chapter is locked

You're reading from Zabbix 4 Network Monitoring Monitor the performance of your network devices and applications using the all-new Zabbix 4.0

Table of Contents (29) Chapters

Troubleshooting Zabbix

The Zabbix log file format

Note

Reloading the configuration cache

Note

Note

Controlling running daemons

Runtime process status

Note

Note

Authors (1)

Other recommended products

Personalised recommendations for you

You're reading from Zabbix 4 Network Monitoring Monitor the performance of your network devices and applications using the all-new Zabbix 4.0

Table of Contents (29) Chapters

Troubleshooting Zabbix

The Zabbix log file format

Note

Reloading the configuration cache

Note

Note

Controlling running daemons

Runtime process status

Note

Note

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products

Personalised recommendations for you