We've covered various Amazon Echo replacements and Google Nest alternatives via a litany of ESP32-powered devices, and it's partially thanks to the relative simplicity of a voice assistant paired with Home Assistant that makes them possible in the first place. However, anything with a screen is a little bit more complicated, and so I set out to build a Google Home Hub replacement using a 7-inch ESP32-powered display, powered by ESPHome.
For this project, I'm using the Elecrow CrowPanel Advance 7-inch HMI ESP32 AI Display, and it has a number of features that make it perfect to use this way. It has an 800x480 IPS touchscreen panel, supports the addition of add-on modules for LoRa, Zigbee, Thread, Wi-Fi 6, or 2.4GHz radio, and it has a built-in microphone. There's a speaker output, a battery connector, a real-time clock, an SD card slot, and multiple UART pinouts for connecting your own modules. It's got a lot packed into one board, and the ESP32-S3 N16R8 is at the heart of it.
If you just want to check out the results of this, you can check out my GitHub repository containing all of the code to make this work, which I ported from the ESP32-S3-Box-3. Now, I have a working voice assistant with a screen that I can build upon and add more features as time goes on in order to make it capable of everything the Google Home Hub can do — and so much more.
Elecrow CrowPanel Advance 7.0" HMI ESP32 AI Display
The Elecrow CrowPanel Advance 7-inch is an ESP32-S3-powered display with an 800x480 resolution. The "AI" naming comes from the fact that it packs an on-board microphone and external speaker support, though it does not contain a built-in NPU or any AI-related hardware features.
About this article: Elecrow sent us the CrowPanel Advance 7-inch panel for the purposes of testing. The company had no input into the contents of this article.
Elecrow CrowPanel Advance 7-inch HMI ESP32 AI Display hardware
Making incredible use of the ESP32's limited pin availability
First and foremost, let's get something out of the way: this particular device does not have any "AI" features whatsoever, despite what the name would imply. I had initially expected it would have some kind of NPU on board to offload AI processing, but where the "AI" naming appears to come from is the fact that it has a built-in microphone and speaker output. "AI" isn't just a microphone and speaker, so I don't really understand why it's in the name.
With that out of the way, that's honestly also the end of (nearly, more on that in a bit) everything negative that I have to say. The build quality is great, the screen is fantastic, and the complex solution to the limited number of pins offered by the ESP32 is about as good as it can be. Elecrow has published some fantastic documentation with code tutorials, datasheets, and manuals for every built-in module present. Those manuals ended up saving me quite a bit when it came to the microphone. It packs very good hardware for the price, and the way the company deals with pins is nothing short of miraculous.
That complex solution to the pin layout involves a pair of switches on the back that you can flip to enable specific functionality, as GPIO comes at a premium here. Your options are:
- Mic and speaker
- Wireless module
- TF card
- Microphone and TF card
Not being able to use all of these at once isn't Elecrow's fault, to be clear. The ESP32-S3 has 45 GPIO pins, but only 36 are actually usable on the ESP32-S3-WROOM-1-N16R8. Some pins control boot behavior, some (such as IO35, IO36, and IO37) are used for Octal SPI PSRAM, and others are used for the display and touchscreen, which are, obviously, essential components for this particular device.
Accounting for PSRAM, we're left with 33 usable pins, and the display with its RGB565 configuration alone uses 20 pins; now we're down from 45 GPIO pins to just 13 pins for everything else. Counting the I2C bus, the touch controller, and the buzzer, there just really isn't much left over.
If I had one criticism of the documentation, it's that it feels somewhat fragmented. While the GitHub repository from the company is well structured and thought out, it felt like every time I searched for something relating to this board, I found another page with even more information that wasn't linked on GitHub. For example, I discovered that the company has published code samples for working with the various add-on modules on this display, and they're linked in the description of a YouTube video published by Elecrow.
There is one more criticism of the documentation that I have, which also somewhat comes back around to praise the strength of it. Getting the microphone working was hard; I couldn't find code samples from Elecrow that demonstrate how to interface with it, but a combination of the GitHub repository, which provides the datasheet for the specific microphone module being used (the INMP441), and information on Elecrow's site gave me what I needed to make it work. The documentation doesn't appear to outline how it works, unlike most of the other hardware features present, but it gives you everything you need to figure it out yourself.
Building a voice assistant with ESPHome
The ESP32-S3-Box-3 project proved invaluable
I've built plenty of voice assistants with ESPHome, and my thought process going into this was "how hard could it be to do it with a display?" It just needs to be basic to start with, and can be expanded over time. After all, the voice assistant is typically the most complex part, and there are plenty of code solutions out there that can be built upon and expanded.
As it turned out, though, it ended up being a bit more complicated than I had hoped. As already mentioned, the microphone ended up being a significantly harder device to interface with than I had previously thought. I went through all of the documentation I could find, tried to dig up solutions that others would have implemented, and I got nothing. While people had used this display with ESPHome before (and Elecrow has published documentation for that, too), it seemed that nobody had even tried to use the microphone before.
I was facing two problems:
- How do I get the microphone working?
- How do I know if my voice pipeline is correct when I have no way of knowing if the microphone is working?
I could answer the second question fairly easily, and that was thanks to a udp_audio external component that I found on GitHub. This allowed me to stream the audio from the microphone to my PC using a basic Python server, and I could confirm that the audio was streaming to my PC, but microWakeWord still wasn't working. Even when I added a button to the screen that I could tap and hold to speak (as a simple way to test my pipeline), Whisper reported receiving empty audio. I had heard everything just fine when streaming to my PC, so I now knew that the microphone was working, and my voice pipeline was what didn't work.
That's when I had the idea: what if I retrieved the ESP32-S3-Box-3 ready-made project from ESPHome, which has a voice pipeline guaranteed to work, and I could instead focus on the hardware side of things and implement features as I needed? That's exactly what I did; I downloaded that project, made a new ESPHome project, and set to work porting all of my pins and hardware-specific features over. I used the same parameters I knew had worked when streaming with udp_audio, but I kept all of the basics in place from the ready-made project until I knew I had something functional.
Lo and behold, it worked! While I still had no audio output (on account of my not having a speaker to test with), I could use microWakeWord, see that Whisper had transcribed my question, and the logs showed that the ESP32 could at least play back the audio. The display was incredibly laggy, though, and you could see the lines being drawn as images were rendered.
The next two parts of this project are where I've stopped for now, as the basics are pretty close to completion and are already in a usable state. The first part was to overhaul the rendering logic, as I had a feeling that ESPHome's graphics implementation was pretty intensive on top of driving a high-resolution panel like this, and the second was to set up a way to redirect playing audio to another device, so that I didn't need to wire up a speaker.
I first started by replacing the entire graphics stack, removing all of the ESP32-S3-Box-3 code and using the Light and Versatile Embedded Graphics Library, known as LVGL. This required removing all of the display rendering code, then carefully rewriting the same functionality using LVGL and updating all of the display update calls to show and hide LVGL widgets.
Once I had completed this, I flashed it to the panel, and much to my surprise (any programmers reading this, I'm sure you can relate), it just... worked. Display updates were near instantaneous, and while I figured that it would improve performance, the difference was genuinely night and day. There are some teething issues here to work out (timers still render strangely, responses are no longer drawn to the screen in text, and some images have the wrong background), but it's a much, much nicer experience.
The next part was to get around the fact that I didn't have a speaker. I still wanted to hear responses, and I remembered that the ReSpeaker Lite voice assistant project used an esphome.tts_uri event that contained the audio file of the response so that it could be redirected and played elsewhere. I have the Home Assistant Voice Preview Edition in my living room, which is connected to a speaker, so I can redirect the audio there.
Thankfully, adding this was very easy, and I was able to use the code from the ReSpeaker Lite to make this work on the Elecrow, too. With a basic script that sends the audio file details back to Home Assistant whenever it's about to play, we can pair it with an automation that reacts to the event and plays it on another device instead.
script:
- id: send_tts_uri_event
parameters:
tts_uri: string
then:
- homeassistant.event:
event: esphome.tts_uri
data:
uri: !lambda return tts_uri;
And here's the automation:
alias: Play TTS URI
description: ""
triggers:
- trigger: event
event_type: esphome.tts_uri
event_data:
device_id: edb7423f767d4c4b5706ed9cfa47a7d8
actions:
- action: media_player.play_media
data:
announce: true
media_content_type: music
media_content_id: "{{ trigger.event.data.uri }}"
target:
device_id: 3b520dfbbac9b1bb6e05e51f8f8a2695
mode: single
Again, it just... worked. At this point, I had a fully visual voice assistant that could redirect audio playback to my speaker, all written in ESPHome, and capable of controlling the devices in my home just like a Google Home Hub can.
What's next?
Bug fixes and improvements
For this project, there are a few fairly simple tweaks and additions I can make, now that everything is working. With LVGL pages, it's trivial to add "idle" pages that contain basic controls for my home, show weather updates, or even show images. It can do a lot of what the Google Home Hub can do when paired with the many, many services that Home Assistant supports, and all you'll be limited by is RAM as you add more features.
Right now, there are a few bugs and minor issues to iron out, like the images that contain the wrong background and the timers that don't display correctly. Yet it's a surprisingly decent experience even now to use it, and I keep it perpetually plugged in at my desk, powered from my PC. I can use voice commands without needing to shout across the room to my Home Assistant Voice Preview Edition, but I can still hear the responses from the speaker that's over there. Of course, if you pick up one of these and just plug a speaker in, you could use that instead just fine.
Displays like these are incredibly powerful and versatile, and you don't need to turn them into voice assistants. You can add your own controls and sensors, and even build a display for a basic game, an RSS feed reader, a smart home dashboard, or just to play around with and learn LVGL.
Elecrow CrowPanel Advance 7.0" HMI ESP32 AI Display
The Elecrow CrowPanel Advance 7-inch is an ESP32-S3-powered display with an 800x480 resolution. The "AI" naming comes from the fact that it packs an on-board microphone and external speaker support, though it does not contain a built-in NPU or any AI-related hardware features.
