





















































In this article by Travis Marlette, author of Splunk Best Practices, will cover the following topics:
(For more resources related to this topic, see here.)
Within the working world of technology, there are hundreds of thousands of different applications, all (usually) logging in different formats. As Splunk experts, our job is make all those logs speak human, which is often the impossible task. With third-party applications that provide support, sometimes log formatting is out of our control. Take for instance, Cisco or Juniper, or any other leading application manufacturer. We won't be discussing these kinds of logs in this article, but instead the logs that we do have some control over.
The logs I am referencing belong to proprietary in-house (also known as "home grown") applications that are often part of the middle-ware, and usually they control some of the most mission-critical services an organization can provide.
Proprietary applications can be written in any language, however logging is usually left up to the developers for troubleshooting and up until now the process of manually scraping log files to troubleshoot quality assurance issues and system outages has been very specific. I mean that usually, the developer(s) are the only people that truly understand what those log messages mean.
That being said, oftentimes developers write their logs in a way that they can understand them, because ultimately it will be them doing the troubleshooting / code fixing when something breaks severely.
As an IT community, we haven't really started taking a look at the way we log things, but instead we have tried to limit the confusion to developers, and then have them help other SME's that provide operational support to understand what is actually happening.
This method is successful, however it is slow, and the true value of any SME is reducing any system’s MTTR, and increasing uptime.
With any system, the more transactions processed means the larger the scale of a system, which means that, after about 20 machines, troubleshooting begins to get more complex and time consuming with a manual process.
This is where something like Splunk can be extremely valuable, however Splunk is only as good as the information that is coming into it.
I will say this phrase for the people who haven't heard it yet; "garbage in… garbage out".
There are some ways to turn proprietary logging into a powerful tool, and I have personally seen the value of these kinds of logs, after formatting them for Splunk, they turn into a huge asset in an organization’s software life cycle.
I'm not here to tell you this is easy, but I am here to give you some good practices about how to format proprietary logs.
To do that I'll start by helping you appreciate a very silent and critical piece of the application stack.
To developers, a logging mechanism is a very important part of the stack, and the log itself is mission critical. What we haven't spent much time thinking about before log analyzers, is how to make log events/messages/exceptions more machine friendly so that we can socialize the information in a system like Splunk, and start to bridge the knowledge gap between development and operations.
The nicer we format the logs, the faster Splunk can reveal the information about our systems, saving everyone time and from headaches.
Here I'm giving some very high level information on loggers. My intention is not to recommend logging tools, but simply to raise awareness of their existence for those that are not in development, and allow for independent research into what they do. With the right developer, and the right Splunker, the logger turns into something immensely valuable to an organization.
There is an array of different loggers in the IT universe, and I'm only going to touch on a couple of them here. Keep in mind that I'm only referencing these due to the ease of development I've seen from personal experience, and experiences do vary.
I'm only going to touch on three loggers and then move on to formatting, as there are tons of logging mechanisms and the preference truly depends on the developer.
I'm going to be taking some very broad strokes with the following explanations in order to familiarize you, the Splunker, with the logger. If you would like to learn more information, please either seek out a developer to help you understand the logic better or acquire some education how to develop and log in independent study.
There are some pretty basic components to logging that we need to understand to learn which type of data we are looking at. I'll start with the four most common ones:
"2/19/2011 6:17:46 AM Using 'xplog70.dll' version '2009.100.1600' to execute extended store procedure 'xp_common_1' operation failed to connect to 'DB_XCUTE_STOR'"
Exceptions: These are the uncommon, but very important pieces of the log. They are usually only written when something went wrong, and offer developer insight into the root cause at the application layer. They are usually only printed when an error occurs, and used for debugging.
These exceptions can print a huge amount of information into the log depending on the developer and the framework. The format itself is not easy and in some cases not even possible for a developer to manage.
This is an open source logger that is often used in middleware applications.
This is a logger popularly used for Linux, and popular for its performance and multi-threaded handling of logging. Commonly, Pantheios is used for C/C++ applications, but it works with a multitude of frameworks.
This is a logger specifically for Python, and since Python is becoming more and more popular, this is a very common package used to log Python scripts and applications.
Each one of these loggers has their own way of logging, and the value is determined by the application developer. If there is no standardized logging, then one can imagine the confusion this can bring to troubleshooting.
This is an example of a Java exception in a structured log format:
When Java prints this exception, it will come in this format and a developer doesn't control what that format is. They can control some aspects about what is included within an exception, though the arrangement of the characters and how it's written is done by the Java framework itself.
I mention this last part in order to help operational people understand where the control of a developer sometimes ends. My own personal experience has taught me that attempting to change a format that is handled within the framework itself is an attempt at futility. Pick your battles right? As a Splunker, you can save yourself a headache on this kind of thing.
While I say that, I will add an addendum by saying that Splunk, mixed with a Splunk expert and the right development resources, can also make the data I just mentioned extremely valuable. It will likely not happen as fast as they make it out to be at a presentation, and it will take more resources than you may have thought, however at the end of your Splunk journey, you will be happy. This article was to help you understand the importance of logs formatting, and how logs are written. We often don't think about our logs proactively, and I encourage you to do so.
Further resources on this subject: