Deploying a 32B Deep Research Agent on an RTX 3090 Is Easier Than You Think
The AI landscape is buzzing with the term “agent.” All the major cloud platforms, from OpenAI to Google, are showcasing deep research agents that can browse the web, synthesize information, and answer complex questions. But as enthusiasts dedicated to local, on-premise AI, we often look at these cloud services and ask, “Can I build that myself?” I’m here to tell you that the answer is a resounding yes, and it doesn’t require a server farm in your garage.
In this guide, I’ll walk you through how I set up a powerful, 32-billion-parameter deep research agent on one of my favorite pieces of hardware for local LLM inference: the NVIDIA RTX 3090.
We will use Alibaba’s WebDancer framework, a capable open-source agent, to build a system that can perform genuine, multi-step research tasks right on your desktop.
Why the RTX 3090 is the Value Champion for Local AI Research
For those of us building systems on a budget, the price-to-performance ratio is king. In my experience, the second-hand market for the RTX 3090 currently presents the best value for local LLM work. With prices hovering around $750, you get 24 GB of GDDR6X VRAM.
This is the critical metric. That 24 GB capacity is the sweet spot that lets us load a 32-billion-parameter model quantized to 4-bits, which is precisely what we’ll be doing.
The memory bandwidth is also substantial enough to deliver a very comfortable token-per-second generation speed. It’s a workhorse card that opens the door to models that were, until recently, out of reach for most home setups.
Our Software Stack: WebDancer and Llama.cpp
To build our research agent, we need two key components. The first is the agent framework itself. We’ll be using Alibaba’s WebDancer, which comes with a specialized 32B model fine-tuned for tool use.
It’s a ReAct-based agent, meaning it can reason, act, and observe, using tools like a web search and a page scraper to gather information.
While other local agent frameworks exist, like QX Lab’s Agentic Deep Research or Jina AI’s DeepResearch, I’ve found WebDancer to be a well-integrated and effective starting point.
For the inference backend – the engine that actually runs the LLM – I’m using llama.cpp. I prefer it over alternatives like Ollama for a task like this because it offers granular control. We can specify exactly how to load the model, which GPU layers to use, and how to expose the server, which is perfect for this kind of project.
Step 1: Setting Up the Llama.cpp Inference Server
Before we can do anything, we need the LLM running and waiting for instructions. For this guide, my system is running Ubuntu 24.04.2 LTS with CUDA 12.6 and the NVIDIA 565.77 driver.
First, you need to clone the llama.cpp repository and build it with CUDA support. Open your terminal and run the following commands. Make sure you have the NVIDIA CUDA Toolkit installed for your system.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Now, we’ll compile the project. The -DGGML_CUDA=ON flag tells it to build with NVIDIA GPU support. I’m using 6 cores for the build (-j 6), but you can adjust this based on your CPU.
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 6
With llama.cpp built, we need the model. I’m using the 4-bit quantized GGUF version of the WebDancer model, which is 20 GB. This size fits into the RTX 3090’s 24 GB of VRAM, leaving us about 3 GB of overhead for the context window, which can hold around 10,000 tokens in this case. Navigate to the models directory inside your llama.cpp folder and download it.
cd models
wget https://huggingface.co/DevQuasar/Alibaba-NLP.WebDancer-32B-GGUF/resolve/main/Alibaba-NLP.WebDancer-32B.Q4_K_M.gguf
Now, let’s start the server. Navigate into the build/bin directory where the executables are. We will launch the server, telling it to use our downloaded model, listen on port 8004 (the default for WebDancer), use a 10k context window, and offload all possible layers to the GPU (--n-gpu-layers 99).
cd ../build/bin
./llama-server \
--model ../../models/Alibaba-NLP.WebDancer-32B.Q4_K_M.gguf \
--port 8004 \
--ctx-size 10240 \
--n-gpu-layers 99
If successful, your terminal will indicate that the server is running and listening for requests on port 8004. Your LLM is now live.
Step 2: Installing and Configuring the WebDancer Agent
With the model’s “brain” running, we now need to set up the agent’s “body” that can interact with it and the web. In a new terminal window, clone the WebAgent repository from Alibaba.
git clone https://github.com/Alibaba-NLP/WebAgent
Navigate into the WebDancer directory and create a dedicated Python virtual environment. I’m using Python 3.12 here.
cd WebAgent/WebDancer
python3.12 -m venv webdancer
source webdancer/bin/activate
With the environment activated, install the required Python packages.
pip install -r requirements.txt
WebDancer needs API keys for its tools. For the search tool, I use Serper, which provides a simple and cheap (with free credits) Google Search API. For the “visit” tool, which scrapes web page content, I use Jina AI’s Reader API. You will need to register for a free account on both serper.dev and jina.ai to get your API keys.
👁 screenshot of webdancer run_demo script withthe api keys
Once you have your keys, you need to provide them to WebDancer. Edit the run_demo.sh script located in the /WebAgent/WebDancer/scripts/ directory.
nano /WebAgent/WebDancer/scripts/run_demo.sh
Find the following lines and replace the placeholders with your actual keys.
# GOOGLE_SEARCH_KEY
export GOOGLE_SEARCH_KEY='your_serper_api_key_here'
# JINA
export JINA_API_KEY='Authorization: Bearer your_jina_api_key_here'
Save and exit the editor. Now, you are ready to run the agent. From that same scripts directory, execute the shell script.
bash run_demo.sh
This will start the Gradio web interface and print a local URL to your console. Open that URL in your browser to start interacting with your own deep research agent.
Putting the Agent to the Test
To see how it performed, I gave it a task from a recent OSINT challenge I was working on. The query was:
“Determine the specific air base in the Middle East where the A-10 Thunderbolt II aircraft bearing tail number 20656 was stationed during Operation Inherent Resolve in the year 2017.“
The agent began its work, showing its thought process. It initiated a web search, analyzed the results, and performed a follow-up search to zero in on the answer.
WebDancer agent web interface output during inference
Within a couple of steps, it correctly identified the Incirlik Air Base in Turkey. This is a simple demonstration of an agent using tools to solve a multi-step problem.
The official authors report impressive scores of 61.1% on the GAIA benchmark and 54.6% on WebWalkerQA, indicating its proficiency on more complex tasks.
Hardware Alternatives for Your Research Rig
While I find the RTX 3090 to be the value sweet spot, you have other options.
Another compelling route, especially if your primary goal is maximizing VRAM, is a multi-GPU setup using two RTX 5060 Ti 16GB cards. This configuration gives you a combined total of 32 GB of VRAM, a significant step up from the 3090’s 24 GB. This extra capacity is ideal for expanding the context window of the WebDancer model significantly or for fitting larger quantized models that are just out of reach for a single 24 GB card. The crucial trade-off, however, is memory bandwidth.
The RTX 5060 Ti 16GB is slower, and this will translate to lower token generation speeds compared to the 3090. You are essentially trading raw inference speed for VRAM capacity, which can be a worthwhile compromise depending on your specific research needs and the complexity of the tasks you run.
On the other end of the spectrum is the RTX 4090. It has the same 24 GB of VRAM as the 3090 but boasts significantly higher memory bandwidth, leading to faster token generation. It is, however, substantially more expensive and might be overkill unless you absolutely need the highest possible speed.
Conclusion: Capable Local AI Research is Within Reach
Setting up a deep research agent that runs entirely on local hardware is no longer a pipe dream. As this guide demonstrates, a well-chosen piece of second-hand hardware like the RTX 3090, combined with powerful open-source software like Llama.cpp and WebDancer, creates an incredibly capable system. You get a private, customizable, and powerful research assistant without paying for cloud subscriptions. The ability to assemble these components, troubleshoot the setup, and watch the agent execute complex tasks on your own machine is, for a hardware enthusiast, a reward in itself.
Allan Witt
<p>Allan Witt is the co-founder and Editor-in-Chief of Hardware-Corner.net. Computers and the web have fascinated him since childhood. In 2011, he began training as an IT specialist at a mid-sized company while launching a tech blog on the side—quickly discovering a passion for writing about hardware and technology.</p> <p>After completing his training, Allan worked as a system administrator for two years. Alongside that, he started building and upgrading custom gaming PCs at a local hardware shop. What began as a part-time project grew into a full-time career. Today, his work also focuses on building and optimizing PC systems for local AI and LLM workloads, combining hands-on experience with a passion for making complex tech easy to understand.</p>0 Comments
Related
Desktops
Dell refurbished desktop computers
If you are looking to buy a certified refurbished Dell desktop computer, this article will help you …
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option to use …
Guides
Refurbished, Renewed, Off Lease
When you are looking for refurbished computer, you often see – certified, renewed, and off-lease placed in …
Laptops
Excelent Refurbished ZenBook Laptops
If you are looking for a compact ultrabook and a reasonable price, consider a refurbished Asus Zenbook …
