VOOZH about

URL: https://thenewstack.io/hybrid-data-collection-from-the-iot-edge-with-mqtt-and-kafka/

⇱ Hybrid Data Collection from the IoT Edge with MQTT and Kafka - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2022-12-16 13:25:46
Hybrid Data Collection from the IoT Edge with MQTT and Kafka
sponsor-influxdata,sponsored-post-contributed,
Data / Open Source / Software Development / Storage

Hybrid Data Collection from the IoT Edge with MQTT and Kafka

MQTT makes it easy to get data from distributed edge devices. MQTT accounts for connectivity issues, which is critical for distributed devices.
Dec 16th, 2022 1:25pm by Jason Myers
👁 Featued image for: Hybrid Data Collection from the IoT Edge with MQTT and Kafka
Image via Pixabay.
InfluxData sponsored this post.
Data collection is one of the biggest challenges when working with databases. If you can’t get data into the database, then nothing else works. While this is true for any use case, one area that can be particularly challenging is IoT. These use cases often involve a large number of systems, sensors and data sources, and they don’t always output data in the same formats or use the same protocols. When it comes to time series data, this challenge becomes even greater because of the sheer number of data sources, and the rates at which they generate data mean you might need to ingest millions of data points every second. Therefore, architecting reliable data ingestion is a critical step for IoT and industrial IoT use cases. A couple of options to consider for that are MQTT and Kafka.

MQTT

MQTT is a popular protocol used in many IoT settings, which relies on a publish/subscribe model. MQTT can generate topics on demand during payload delivery. If a topic already exists, MQTT sends the data to that topic. If the topic doesn’t exist, MQTT creates it. MQTT payloads are really flexible, which means you don’t need to define a strict schema for your data, but it also means that your subscribers need to be able to handle topics that fall outside the norm. In short, MQTT is easy to configure, reliable and extremely flexible.

Kafka

Kafka is an event-streaming platform also based on the publish/subscribe model with the added benefit of data persistence. For IoT use cases, Kafka also offers high throughput and availability, and it integrates well with many third-party systems. While Kafka is similar to MQTT in some key ways, it isn’t a cure-all for IoT architecture. This is because Kafka is built for stable networks that deploy good infrastructure. IoT devices typically run the gamut; a single system may have devices with significant resources and great connectivity, while other devices may have limited footprints and intermittent connectivity. Kafka also doesn’t deploy key data delivery features such as Keep-Alive and Last Will.
InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData

Hybrid Data Collection with MQTT and Kafka

The goal here is to create a data pipeline that leverages the benefits of each protocol to help mitigate the drawbacks of the other. If you want to play with the code for this project, you can grab it from the InfluxData Community GitHub repo. In this example we’ll simulate monitoring emergency fuel generators. We want to collect data from the generators, store it in InfluxDB, process and analyze that data, and then send that processed data on for use elsewhere. We could connect MQTT and Kafka directly to the generators, but having already highlighted the pros and cons of each, we want a more durable solution. From an architectural perspective, we can visualize the differences between these two protocols in this simulation. 👁 Image
Instead of connecting everything in this way, a more logical approach would be along the lines of the following: 👁 Image
Here, we collect data directly from our edge devices using a Kafka MQTT proxy, essentially a MQTT broker with Kafka functionality bolted on. This allows our MQTT clients to directly connect to it. Given this, we can amend the diagram above to better reflect what’s going on here. 👁 Image
The MQTT proxy sends the data to the Kafka cluster. We do this to take advantage of Kafka partitions, which provide consistent ordering of records and parallelism for performance purposes. This means we can create separate queues for data from individual or groups of devices. Suppose each generator produces 1,000 records per second and we have a microservice trying to consume those messages and perform some logic. We know the microservice can only handle 500 records a second so our consumer will never be able to achieve consistency with our producer. Using partitions allows us to spin up additional instances of our microservice, as needed, to consume a portion of those records in parallel to one another. This achieves a much higher throughput. We can also use partitions for topic replication. In this case, we can reserve partitions to hold copies of our records. Because these protocols can distribute our topics over many servers, this approach helps to prevent data loss due to outages. If one partition is unreachable, then our consumer will simply take records from the backup partition instead. Getting back to the example, we use the InfluxDB Sync connector to write the data from the Kafka partitions into InfluxDB, where we can use the Flux language to downsample and aggregate the data. Then we send the downsampled data back to Kafka using the InfluxDB Source connector for further distribution and consumption.

More than a Data Store

Let’s unpack that last part a bit. What we’ve got here isn’t just a simple data store. Instead, we’re using Flux to push data analysis work down into the database for greater efficiency. 👁 Image
As you can see in the diagram, we use InfluxDB Sync to write our raw data to a bucket called “`kafka_raw“` in InfluxDB. We’re going to use Flux to get data from the raw bucket, enrich it and write that data to the “`kafka_downsampled“` bucket. Our Flux script looks like this:
import "influxdata/influxdb/tasks"
 
option task = {name: "kafka_downsample", every: 1m, offset: 0s}
 
from(bucket: "kafka")
 |> range(start: tasks.lastSuccess(orTime: -1h))
 |> filter(fn: (r) => r["_measurement"] == "genData")
 |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
 |> group(columns: ["generatorID"], mode: "by")
 |> last(column: "fuel")
 |> map(fn: (r) => ({r with alarm: if r.fuel < 500 then "refuel" else "no action"}))
 |> set(key: "region", value: "US")
 |> to(
 bucket: "kafka_downsampled",
 tagColumns: ["generatorID", "region"],
 fieldFn: (r) =>
 ({
 "alarm": r.alarm,
 "fuel": r.fuel,
 "lat": r.lat,
 "lon": r.lon,
 "power": r.power,
 "load": r.load,
 "temperature": r.temperature,
 }),
 )
So what does all this mean? Here’s what’s happening.
  • The range() + tasks.lastSuccess function collects the data from the “`kafka“` bucket since the last time the task ran. It defaults to a static value of the past hour (-1h) if this is the first time the task runs.
  • Next, we filter the data so we only return values for the measurement genData.
  • We use Pivot() to shift vertically stored values into a horizontal format. This makes it more like working with SQL databases, and the horizontal format is important for the map function.
  • We group() by our generatorID column. This separates each generator’s data into its own table.
  • Last() selects and returns the last row of each table.
  • We then use map(), along with some conditional logic, to check our current fuel level and to create a new column called alarm. The function fills the alarms value based on the conditional logic.
  • set() allows us to create a new column and manually fill each row with the same value.
  • The to() function transfers our data to kafka_downsampled. We provide some mapping logic to define which columns are fields and tags.
To create this task in the InfluxDB UI, click the CREATE TASK button, name your task and set how often you want it to run. Then you can write your task script in the task window.

Benefits

This architecture design accomplishes several things. Leveraging MQTT makes it easy to get data from distributed edge devices. It’s lightweight, so it can go virtually anywhere, and it’s built to handle thousands of connections, so the volume and velocity of IoT data aren’t an issue. Plus, MQTT accounts for connectivity issues, which is critical for distributed devices. In other words, it’s flexible for bridging the initial gap between the physical and digital worlds. Leveraging Kafka allows us to organize that data as it moves through the data pipeline. Kafka provides high availability, so we know that whenever our edge devices have connectivity, they’ll be able to reach Kafka. Partitioning increases data throughput, so we can ensure that all that IoT data actually makes it to the data store, with the help of Kafka’s enterprise-level connectors. InfluxDB enhances the entire data pipeline by providing both data transformation capabilities and storage. To learn more about the code for this example and the journey to creating this hybrid architecture, check out this series.
InfluxData is the creator of InfluxDB, the leading time series platform. More than 1,900 customers use InfluxDB to collect, store, and analyze all time series data at any scale. Developers can query and analyze their time-stamped data to predict, respond, and adapt in real-time.
Learn More
The latest from InfluxData
TRENDING STORIES
Jason Myers earned a PhD in modern Irish history from Loyola University Chicago. Since then, he has used the writing skills he developed in his academic work to create content for a range of startup and technology companies. When he's...
Read more from Jason Myers
InfluxData sponsored this post.
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Pragma.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
👁 Image
Join the millions of developers using InfluxDB to predict, respond, and adapt in real-time.