There are countless options out there to build your own smart speaker, unshackling yourself from the likes of Google and Amazon, but it's hard to beat the hardware these companies use. Those companies can essentially operate as loss leaders; the idea being that they can make a loss on that sale, because most consumers will then spend more money on other products within their respective ecosystem. A self-hosted solution can be cheap, yet it's hard to meet that level of polish. That's where the ReSpeaker XMOS XVF3800 with the XIAO ESP32-S3 from Seeed Studio comes in.
To be clear, this is not a board that's cheaper than the complete packages offered by major voice-assistant makers, but it's close. It packs four microphones (compared to the six of the Echo), an onboard audio processor, and the XIAO ESP32-S3, plus a mute switch, a reset button, and 12 LEDs on the back that show the direction of incoming audio. It's feature-packed, and at $60, it gives me everything I need to replace my Amazon Echo for good.
ReSpeaker XVF3800 with XIAO ESP32
The ReSpeaker XVF3800 with the XIAO ESP32 is one of the best ways to get started with building your own voice assistant. There's a ready-to-go ESPHome configuration, and a lot of documentation to get you started with building your own processing pipeline, too.
About this article: Seeed Studio sent us the ReSpeaker XVF3800 with the XIAO ESP32-S3 for the purposes of this article. The company had no input into its contents.
The ReSpeaker XVF3800 has incredible hardware
It's packed to the gills with good components
It may seem like a simple board, but it's not all too dissimilar to what you would find inside today's heavyweights of the voice assistant world. You have an audio processor, four microphones spread evenly across the edges, a Wi-Fi antenna (which the ESP32 connects to), and two options for audio out: a 3.5mm jack or a JST PH 2.0 connector for a speaker.
However, the star of the show is the ESP32-S3, and it's what enables true customization of this board. The standard variant of the XVF3800 (that is, without the ESP32-S3) can be used with a computer over USB as a standard microphone out of the box, but that's about it. Any customization or testing relies on either bringing your own microcontroller and wiring it up or processing the microphone using a computer that's connected to it via USB.
The ESP32-S3 comes pre-soldered, and the specific variant used here has a 240MHz Xtensa 32-bit LX7 dual-core processor, 8MB of PSRAM, and supports both Wi-Fi (2.4 GHz only) and BLE 5.0. It's one of the most powerful ESP32 variants out there, and with the exposed GPIO pins on the board, you could tack on additional sensors if you wanted.
When configuring the XVF3800 to use the ESP32 on-board, you'll likely need to flash it using DFU. It has two modes: one for USB and one for Inter-IC Sound, otherwise known as I2S. Only one can be active at a time, and out of the box, it will be flashed to use USB mode. If you want to switch back to USB mode after setting up I2S (as data transfer is entirely disabled over the board's USB mode once I2S is enabled), you'll need to hold the mute button down while plugging it into your computer. This will put it in safe mode, where you can then flash it back to USB mode.
The XVF3800 chip has a bunch of audio-related features that make it a marked improvement over the ReSpeaker Lite, as good a board as that one may also be. It has echo cancellation, beamformers to track audio sources, noise suppression, and automatic gain control, and all of these enable improved speech recognition and audio clarity.
In terms of hardware, this is one of the best turnkey solutions to a self-hosted voice assistant you can get. As we'll get into, with ESPHome, you can deploy a Home Assistant-based voice assistant in a matter of minutes, hook up a speaker to it using the 3.5mm jack, and it'll start working immediately. It's on the pricier side, but you control every aspect, making it a worthwhile investment for your smart home.
One thing I do have to wonder about the design is the fact that it's shaped in a way that would suggest you could drop the PCB into an existing Amazon Echo, but sadly, you can't. The screws don't line up, so it seems to be shaped this way to simply facilitate building your own, similar device. It's not even a criticism because it was never positioned as a drop-in replacement; it just would have been cool to see.
Setting up the XVF3800 with Home Assistant
And using PlatformIO
The documentation, like the other products from Seeed Studios, is quite well done. You'll find code examples and full, deployable projects, so that you can get a feel for how the hardware works. In this case, one of the fully developed projects you can deploy is a port of the voice assistant developed for the ReSpeaker Lite, which in itself is a port of the Home Assistant Voice Preview Edition Home Assistant Voice Preview Edition firmware. In just a few minutes, I had a fully working voice satellite in Home Assistant, using the XVF3800.
This works great, and with an audio output going to a speaker, it's an Assistant that works across the room, which I've tested and found to be quite consistent when paired with my own home voice pipeline using Whisper and a local LLM. Timers are exposed to Home Assistant so that they can be displayed on other devices, and everything just... works. It's fantastic.
I tried to deploy something more interesting, but I failed to get anything to work while developing in PlatformIO. I wanted to display an audio spectrum visualizer on my PC, using the XVF3800 as a microphone, but I was unable to get it working. In fact, I was also unable to get the MQTT example working, found in the Seeed Studio documentation. I had thought it was because I had used the firmware specifically built for ESPHome at first, but flashing the original I2S firmware did nothing, either.
With the sample code, while it would write to my MQTT server, it was filled with 0 bytes after the header. I am unsure what the cause of this is, as the pins match those found in the ESPHome example, which does work, and given I'm not familiar with the Seeed Studio implementation, I was at a loss when it came to debugging. I then tried to tweak the example to stream to a basic Python server I built on my PC, but it only sent FF values, so somewhere it can't make the connection between the ESP32 and the microphones to pull data correctly.
I'm hopeful I can figure something out and cover it in a future article, as it's clearly possible to interface with it, given that the ESPHome example works. As for why the official example doesn't work, though, I'm not really sure why, despite spending a lot of time trying to figure it out. I do have some concerns regarding whether or not the ESPHome configuration will be kept up to date, but for the more technically adept, it should be trivial to port over new features from the Home Assistant Voice Preview Edition as they arrive. I also like that you can control the LEDs if you wish to override their default behavior.
Moving to audio quality, as you can hear from the above clip, recorded using the XVF3800 in USB mode on my Windows PC, it sounds pretty decent. It's not an incredible microphone by any means, but I can be understood, and the "voice focus" feature that I mentioned was surfaced by Windows doesn't actually change how the sound comes through.
You can imagine any decent speech-to-text transcription model will be more than capable of transcribing my speech, and when moving five and even ten meters away, it still sounds clear, just quieter. This explains why it works so well with ESPHome, as it can clearly and easily understand my speech.
The ReSpeaker XVF3800 is a great way to build your own voice assistant
Or just test your own audio pipelines
I've been playing around with a lot of hardware when it comes to building a custom voice assistant, and the ReSpeaker XVF3800 is one of the best that I've used. The microphones are great and work from across the room, and I've had no issues being heard and understood by my Whisper model when the audio is provided by this particular board.
The LEDs aren't the most useful for what I would feel would be the majority of setups using these, but they're cool and can show the direction the audio is coming from. I do find them pretty, and it's cool to see when it recognizes that I'm speaking. From my testing, it appears that the board seems to "learn" what the ambient sound is like, so only "new" noises (such as speaking) will cause it to light up and point in the direction that it came from.
To give you an idea of this cancellation, when I first plug it in, the lights point in random directions as it appears to acclimatize to the environment. However, even now, as I type, they're pointing in the direction of my typing, but not the direction of the fan that I have on the opposite to it. I'm quite impressed by the accuracy, though I reckon it's not too difficult to get right. It's likely just a triangulation of the input volumes on each microphone, with the highest volume being the most likely to point to the source, but it's still pretty neat.
As a result, if you're looking for the closest thing to an "Echo" killer that you'll find, this is likely it when paired with Home Assistant. There's an out-of-the-box-ready configuration you can deploy that will just work, and if that's all you want it for, you can get that set up in minutes. If you want to do your own development and testing, you can as well, and it's likely that I missed something when doing it myself. I'll likely build some weird project with this device in the future, so I'll update this article when I do that to explain how I got around it and fixed it.
If you want your own self-hostable voice assistant, and you want something with a bit more chops than the ReSpeaker Lite, this might just be what you're looking for.
