![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Editor’s Note: This article is an excerpt from Chapter 1 of The Manning book “Logging Best Practices: A Practical Guide to Cloud Native Logging.” Download the full book to finish reading Chapter 1 to learn about a few best practices to maximize log value.
Understanding any log event requires context. When we’re developing and using trace and debug logs, the context will, to an extent, be known to us, perhaps implicitly, as the position in the code will be part of the context or the test scenario being run will be the context. But when we come to production, the context is not likely to be implicit, so we need to make it explicit.
The key to the context is how well we’re answering the following:
Let’s explore these points in a bit more detail.
The log event’s “what” is partially addressed by the log level being included in the event. For trace log events, the fact that the event is logged is probably enough when combined with “where.” For info log level and above, we’re going to provide some additional detail: Is the info record for audit purposes? What kind of error has occurred? What does the warning relate to (such as a shortage of storage)?
The “what” is best supported with details that allow a transaction to be identified, including the type of transaction. The transactional data or a proxy, such as a unique ID for a transaction (so we can look up the actual transaction data), should provide sufficient insight; for example, if the transaction is missing a reference to associated data, we need to see that the value isn’t set.
Logging frameworks address most of this for you without any effort, but are likely to go back to the server’s system clock. The implications of time zones, clock skew and so on can catch you out if you’re looking for a solution that is running in a time zone that applies daylight saving time (because someone is looking at the timestamp and it appears to be out because they’re applying daylight saving time, but the log isn’t) or the solution is globally distributed. So you need to know which time zone the server is in. One option is to configure the logging framework to include the time zone in the log, but better still, align all servers to UTC (Coordinated Universal Time).
When trying to align a log analysis with a user error report, you will need to be clear about which time zone the user is working in and whether the error report is recorded against their time or system time.
Naming the code location requires some awareness of how the code is handled. This is especially important when the code is deployed for a commercial solution where obfuscation and minifier tools are likely to be used, particularly on script-based solutions such as JavaScript. As a result, relying on reflection to get details of where the code is can be unhelpful. Although there are tools, some obscurification providers will include mapping information to identify the original code, given the correct information.
Note: For more information on code minifying and obscuring, check out the following resources:
Applications are typically either multithreaded (such as Java) or context-switched on an I/O wait (Node.js), and they are single-threaded. Context switching means we don’t handle one transaction at a time, so understanding whether log events preceding or following the event of interest are related can become challenging. This can be overcome by incorporating transaction IDs or session IDs, or by using open tracing or open telemetry IDs as part of the log event.
Some logging frameworks will help you capture a thread or process ID in their configuration. For example, in Fluentd, we can use the WorkerId in the log file output.
“Where” can also be influenced by software versions. We can have multiple versions of the same logic in production at one time to support activities such as:
Here’s another way to look at it: You spot an image that has not been rendered very well in this book. You contact Manning. To help you, we need to know which figure is faulty. What if the issue has been seen before and been fixed in a new edition of the book? This isn’t saying that every log event needs to publish every aspect of version information, but we do need to make it easy to supply sufficient information.
Perhaps when we log errors or worse, this information is written into the log. This is an area where injecting into log events can be helpful. If a log event is identified as reposting something abnormal in the software, such as an error, then Fluentd could retrieve the version of the software running and inject it into the logs for future reference.
This comes down to why an event has occurred — is it an error or just a signal to show where the code is (trace) or the application’s current state (debug)? As we move to the higher levels (warning, error and fatal in our classification), “why” becomes more important and less evident from just the log level. The information-level log events could be an audit or a periodic snapshot of the system’s current state, regardless of whether things are good or bad — for example, logging how deep a message queue is. However, the log event consumer needs to be able to understand why the event is being generated. With a bit of thought, this is easily solved.
A simple attribute, such as “current status” or “audit action,” could be included along with the shared data. We are, in effect, providing a secondary classification in many cases for the log event. Given that we provide additional metadata, we might structure it as long as we are consistent within the development organization.
When it comes to reporting warnings and errors, the “why” comes back to “what” triggered the warning or error. Is it the primary error, or has something occurred as a by-product of a previous issue? Trying to indicate whether an error is cause or effect is difficult. If we can be certain, we should be clear; if it isn’t, we can perhaps give the log event consumer some hint about the possibility. Coding such information can be complicated and hard to test. But it is easy to link to an error code and provide steps to confirm cause or effect.
The record we generate with the event needs to clearly provide the information to help perform a diagnosis, not just operationally, but also whether something in the code may be needed, such as a more defensive code or better data validation. As the solution is now on an unhappy path, we should not be afraid to be generous with the information — as long as it doesn’t raise sensitivity issues, which we’ll look at shortly. For error-handling paths, we’re in a place where performance should not be a consideration, as this part of the codebase should only execute infrequently. Generally, too little information is a lot worse than too much.
Logging of “who” can be tricky. Logging information that is identifiable to an individual will make our log processing subject to legislative, contractual and commercial requirements.
The important thing is to consider when the “who” is necessary and whether we can safely use other data as a proxy for the true identity. For example, perhaps the “who” is relevant only during the logged-in session, so we just need to carry the session ID and use that. If we need to attribute the actions in a session back to a specific individual, we record that separately in a secure way. That session ID could equally be a transaction or order ID, and so on.
When recording events such as failed logins or application interactions that do not require a specific individual, we may still need a value for “who,” such as an originating IP address. For example, a single server ping may be harmless (alive service reporting is likely to just do this), but a really rapid repeated occurrence from the same location is not good. However, having that IP means it is possible to determine that it was the same system calling and therefore who to block.
Working with a client’s DevOps team, we discovered the client’s security team employed a third-party organization to regularly run probes across all their internet-facing servers. We figured out what was happening, as we’d see our API gateway servers reporting illegal requests originating from one of several IPs on a regular frequency. Once we identified the pattern and the logged details like the IP origin, time and the HTTP request, we raised our suspicions with the security team, who confirmed the use of a third party.
Don’t forget that the “who” could be a system or application component. For example, when processing payroll, that activity is triggered by a scheduler. So it is helpful to know which schedule or scheduler triggered the process.
Addressing what, when, where, why and who can be a little abstract. Personally, I try to address it using the following questions:
Time zones cause confusion due to daylight saving shifts, server location differences and user vs. system time mismatches. A best practice is to align all servers to UTC and clearly note whether timestamps are user-reported or system-generated.
It’s often unclear whether an error is the cause or a side effect, so logs should include hints or references (like error codes) for clarification. Providing rich context like correlation IDs, classification attributes and extra details (without exposing sensitive data) is better than too little information.
This article is an excerpt from the Manning book, “Logging Best Practices.” Even with a health check in place, you need to know why a check might fail. This is where Fluent Bit’s error codes come in, offering a window into the exact issues affecting your inputs and outputs. You can read “Logging Best Practices” in its entirety here.