![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
| Editor’s note: This article is an excerpt from the Manning MEAP (Manning Early Access Program) book, “Effective Platform Engineering.” In MEAP, you read a book chapter by chapter while it’s being written and get the final eBook as soon as it’s finished. |
When an issue reaches us, we need to know what happened or, better, be able to engage with it as it is happening. The issue might be a serious system failure or a question about how something or someone was or is interacting with our system(s). To address the issue, we must have this information available and a tool that fits our needs. Fluent Bit is such a tool.
In this excerpt of “Effective Platform Engineering” (the book’s first chapter.) we’ll take a moment to understand what Fluent Bit is and answer some important questions about it, such as why it is so important — and worthy of a book — and how it fits into the IT ecosystem.
Fluent Bit is, at its heart, a specialized event capture and distribution tool. Let’s break that statement down a bit. Why is it specialized? Fluent Bit focuses on log events, metrics, and traces (sometimes called signals):
It’s important to note that trace identifiers are carried through the different parts of the application. Traces have become more significant with Kubernetes and the adoption of microservice strategies because, when used properly, they can make following what is happening across distributed solutions far easier.
The book “Effective Platform Engineering” explores event data in greater depth. The ability to handle various events within a single tool isn’t unique, but it does distinguish Fluent Bit from technologies it’s sometimes compared with, such as Logstash.
Because Fluent Bit reacts to and processes events, typically in near real time as they’re received or tracked from sources such as a file, it’s described as event-driven.
Why do we need Fluent Bit to be event-driven? After all, we look at the data when something isn’t right. Although we may adopt the traditional approach of looking at logs when someone has declared there to be an issue, people still like to see stats and metrics closer to real time.
We should also remember that we can derive meaningful time-sensitive metrics from log events. In our code, we are interested in the events when our software has done something that may be of interest to confirm that:
Even when a scheduler triggers the monitored solution, we want the logs, events and traces to be provided when they are still meaningful.
Clever words, then, for something mundane? It would be easy to think that. Unfortunately, this thinking can lead us to miss a wealth of possibilities and opportunities that Fluent Bit offers to make our lives a lot easier.
If we consider a log event as just a block of text from our code, for example, we may overlook that we can derive meaning from it and determine whether something else needs to occur there and then.
If the event is a health check indicating everything is fine, we could send the data to the operations dashboards and do no more. But if the event reports the receipt of a large, malformed payload, it could indicate a more serious problem that needs immediate intervention before users start calling to complain.
Tackling the pain of identifying (and possibly needing to resolve) an issue with a system benefits us all individually, whether we’re:
The information we need to address an issue could be as simple as the complete log message. Often, we need to understand what happened before, during and after the event of concern to establish cause and effect. (For example, a database may be producing errors because we’ve run out of storage. Did we run out of storage because the housekeeping process failed, or did we overlook the need to monitor our storage capacity?)
We need to capture and aggregate data from many different sources. Logs, metrics and traces are the building blocks of observability, and monitoring data (logs, events, and traces) is generally transient. Using Fluent Bit, and tools like it, enables us to gather data from all sources and put it somewhere secure. It’s been my experience that when things go seriously wrong, people aren’t worrying about preserving state information, logs and the like. Their concern is returning to an operational status, which can mean that logs and stored metrics in the production environment may easily be trashed.
Aggregating log events doesn’t just mitigate the risk of data loss, but also helps us see the complete picture. COBOL solutions, for example, were usually made up of multiple programs run in sequence. Processes were sequential, but distribution processes were already possible. As technology advanced, we adopted two- or three-tier solutions running concurrently (application and database servers, usually with separate UIs).
Even if we’re operating monolithic application servers, work can be spread across multiple virtualized load-balanced servers, and microservices have led to a further explosion of distribution. To make sense of what is happening, we need to bring together all the events spread across all these distribution points to get an accurate picture of what is happening.
Aside from being able to preserve information that can help us diagnose an issue, we can easily overlook one challenge: The more time we take to get from issue to diagnosis, the more damage can occur, and therefore, the more painful the recovery process becomes.
Whether we’re fixing failed transactions or working out the scale of a security breach, by processing the metrics and logs as they occur, we can automate the evaluation of whether they indicate an issue occurring now or, better, an imminent problem. Thus, we can reduce the amount of pain because we’ve avoided or kept the effect of the issue as small as possible.
The ability to distribute data easily also allows us to adopt different tools for different tasks. If the data is difficult to distribute, we end up with the lowest common denominator or with tools that support the most vocal team using the data rather than ones that address different needs. PagerDuty, for example, is ideal for notifying the right person depending on the identified system and the time and day of the week.
The Fluent tools, Fluentd and Fluent Bit, are key players in the Cloud Native Computing Foundation (CNCF) ecosystem, helping us gather, secure, and, ideally, analyze logs and metrics. These solutions allow us to get the observability data (logs, traces and metrics) in a form that another tool can render in an easily digestible format. Fluent Bit is having a greater effect than Fluentd in terms of adoption and support for the latest observability standards and tools, as we’ll see.
Within the CNCF, projects are classified to reflect their process, quality, maturity, support and adoption. Graduated projects such as Fluentd and Fluent Bit need contributors from multiple organizations with processes that demonstrate good project governance and development processes. Most importantly, these projects need several public adopters so the wider community can be confident that it will not likely adopt something that could be abandoned overnight.
This article is part of a series. Read also: