Voozh

The ESP32-P4 is a behemoth of a chip, and the Elecrow CrowPanel Advance pairs it with an ESP32-C6 for wireless connectivity. I've been playing around with it over the last two months, and it seemed like a perfect fit for a project I found at Maker Faire Shenzhen last year: XiaoZhi. XiaoZhi is a full chatbot platform that can run on an ESP32, allowing you to have a conversation out loud with either a local LLM or a cloud-based model.

Unfortunately, over the time that I've spent porting it, Elecrow has actually released its own port of the XiaoZhi platform for the ESP32-P4 as well. You can check out their project files here, but I'll be going over what XiaoZhi does, how it works, and why my display now shows emojis to communicate emotion alongside its responses.

Interested in more maker-related content? We recently launched the XDA Maker Weekly newsletter, featuring unique and original content you won't find anywhere else on XDA. Get subscribed by modifying your newsletter preferences!

What is XiaoZhi?

A voice assistant with "emotion"

XiaoZhi is an open-source AI chatbot platform designed to run on ESP32 microcontrollers. It's a Chinese-language project at its core, though it supports English and other languages depending on the LLM backend you connect it to. What makes it interesting isn't just that it's a chatbot, because there are plenty of those, but that it handles the entire pipeline on surprisingly modest hardware. That includes wake word detection, voice activity detection, speech-to-text, LLM processing, and text-to-speech, all coordinated through an ESP32.

The way it works is actually pretty clever. The ESP32 itself isn't doing the heavy lifting for the AI inference, given that would be asking a lot of a microcontroller. Instead, XiaoZhi acts as a client that captures audio from a microphone, streams it to a server for processing, and then plays back the response through a speaker. The server side handles the STT, LLM inference, and TTS, while the ESP32 manages the audio I/O, the display, and the interaction logic. You can point it at a self-hosted backend running something like Ollama, or you can use one of the supported cloud services like the official XiaoZhi site.

What drew me to XiaoZhi at Maker Faire was seeing it running on a basic ESP32-S3 board with a tiny screen and a speaker, activated by motion detection, and being able to have a full back-and-forth conversation in real time. The latency was surprisingly low, and the project had clearly been built with actual usability in mind rather than being a proof-of-concept that falls apart the moment you try to do anything real with it.

The display side of things is where it gets fun, too. XiaoZhi supports theming and emoji expressions, so the device doesn't just talk back to you, as it also reacts visually. On the CrowPanel Advance's 7-inch display, this means you get a large animated emoji face that shifts expression based on the emotional tone of the response. Happy replies get a smiling face, confused answers get a puzzled look, and so on. These emoji assets and themes are generated and managed through the XiaoZhi web platform, where you can customize the look of your device, pick different face styles, and push them to your board.

If you're running the whole thing self-hosted, the process looks a little different. You'll need to set up the XiaoZhi server stack yourself, which means running the STT engine (typically something like FunASR or Whisper), your LLM of choice through Ollama or a compatible API, and a TTS engine like CosyVoice or EdgeTTS for generating spoken responses. The emoji and theme assets still come from the XiaoZhi platform by default. However, the device pulls them down during setup, and since the project is open source, you can swap in your own assets or modify the display logic if you want full control over the visuals. The emotional tagging that drives which emoji gets displayed is handled by the LLM itself on the server-side, and the ESP32 parses that tag to pick the right face. It's a simple but effective approach, and it means the quality of the emotional reactions is only as good as the model you're running behind it.

Elecrow's CrowPanel Advance is overkill for this, and that's the point

It looks a lot better than most other deployments

The CrowPanel Advance is a board I've already covered in detail, but it's worth reiterating why it works so well here. Most XiaoZhi builds you'll see online run on an ESP32-S3 with a small screen and a tiny speaker, and they work fine, given the fact that the platform was designed for that. But the P4 gives you headroom that those boards simply don't have. With 32MB of PSRAM, dual 400MHz RISC-V cores, and a 1024x600 display, the CrowPanel runs XiaoZhi comfortably. The emoji faces aren't crammed into a tiny square, the audio doesn't crackle through a terrible speaker, and you're not constantly bumping up against memory limits when the LLM sends back a longer response.

The built-in speakers and microphone mean you don't need to wire up external audio hardware, either. On smaller boards, getting I2S audio working correctly is often half the battle, and even then, you're usually stuck with a speaker that sounds like it's talking to you from inside a tin can. Here, the audio quality is genuinely usable for conversation, which matters a lot when the entire point of the project is to talk to the thing. The ESP32-C6 handles Wi-Fi over SDIO, and while I've had my share of issues with that bus in the past, once it's configured as a 1-bit bus, it's completely stable.

Elecrow's own port handles all of the board-specific configuration for you, so if you just want XiaoZhi on this display without the porting headaches, that's the path of least resistance. My own process involved defining a new board configuration, adapting the display driver, sorting out the I2S audio pipeline for the P4's software-defined codec, and dealing with SDIO and the ESP32-C6 saving its own Wi-Fi configuration that conflicted with XiaoZhi's.

What can XiaoZhi actually do?

MCP support means the sky is the limit

It's easy to dismiss XiaoZhi as a chatbot running on an ESP32, but it's more versatile than that. The LLM backend is entirely configurable, so you're not locked into a single use case. You can define your own system prompt, swap models depending on what you need, and because it supports both cloud and self-hosted backends, you control where your data goes. I'm using it with GLM 4.7 Flash through the XiaoZhi cloud, but deploying a self-hosted server instance using Docker is just as straightforward and I was able to do that too in just a few minutes.

XiaoZhi's real edge is MCP support, or Model Context Protocol. MCP is a standard that lets LLMs interact with external tools and services, and XiaoZhi supports it out of the box. That means your ESP32 chatbot isn't limited to answering questions. It can actually do things. Depending on the MCP servers you configure, you could ask it to check the weather, control smart home devices, search the web, or pull data from an API, all through voice. Even something like changing the backlight percentage works, thanks to the local MCP server it runs.

Even without MCP, the conversational loop is fast enough that you're not sitting around waiting for a response, especially on a cloud backend or with a reasonably powerful local model. You could set it up as a kitchen assistant, a storyteller for kids, or a desk companion you talk to while you work. The interaction feels weirdly natural, and I haven't used a voice assistant that was this conversational or this quick to respond before. Which is funny, because a natural but quick conversation is exactly what you want from a voice assistant.

The self-hosted route is where it gets interesting if you're already running a home server. Once you've got the XiaoZhi server stack running, namely speech-to-text, a local LLM, and text-to-speech, you've got a fully private voice assistant that never phones home. It doesn't even need to contact the outside world. If you've already gone down the self-hosting rabbit hole with Home Assistant or similar platforms, adding XiaoZhi is a natural extension of that same thinking.

XiaoZhi is one of the best open-source voice assistant projects I've used

It just works out of the box

XiaoZhi caught my eye at Maker Faire because it felt like a finished product running on hardware that had no business pulling it off, and after spending two weeks porting it to the CrowPanel Advance, that impression has only gotten stronger. The ESP32-P4's hardware paired with a well-designed software stack means you get a voice assistant that's responsive, expressive, and actually useful, rather than a simple tech demo.

I think what sets XiaoZhi apart from other ESP32 voice assistant projects I've worked with is how complete it feels. Wake word detection, voice activity detection, emotional expression, MCP tool support, and a self-hosted option that works well out of the box are all here, and everything just works seamlessly. Elecrow now has its own port too, so the barrier to entry on this specific board is even lower than when I started.

If you don't have this particular board, that's fine too. XiaoZhi runs on plenty of cheaper ESP32-S3 boards, and while you won't get the same display real estate or audio quality, the core experience is the same. It's an open-source project with an active community, and it's one of the more impressive open-source projects I've seen for an embedded device. If you're looking for a voice assistant project that goes beyond the basics, XiaoZhi is worth your time setting up and getting to grips with.

URL: https://www.xda-developers.com/turned-esp32-display-voice-assistant-emotions/

⇱ I turned my ESP32 display into a voice assistant that reacts with emotions

What is XiaoZhi?

A voice assistant with "emotion"

Elecrow's CrowPanel Advance is overkill for this, and that's the point

It looks a lot better than most other deployments

What can XiaoZhi actually do?

MCP support means the sky is the limit

XiaoZhi is one of the best open-source voice assistant projects I've used

It just works out of the box

URL: https://www.xda-developers.com/turned-esp32-display-voice-assistant-emotions/

⇱ I turned my ESP32 display into a voice assistant that reacts with emotions

What is XiaoZhi?

A voice assistant with "emotion"

Elecrow's CrowPanel Advance is overkill for this, and that's the point

It looks a lot better than most other deployments

What can XiaoZhi actually do?

MCP support means the sky is the limit

Subscribe to the XDA Maker Weekly newsletter for maker insights

XiaoZhi is one of the best open-source voice assistant projects I've used

It just works out of the box