👁 Image
ElBruno.LocalLLMs.Rag 0.18.0

.NET 8.0

dotnet add package ElBruno.LocalLLMs.Rag --version 0.18.0

NuGet\Install-Package ElBruno.LocalLLMs.Rag -Version 0.18.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ElBruno.LocalLLMs.Rag" Version="0.18.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ElBruno.LocalLLMs.Rag" Version="0.18.0" />
 

 Directory.Packages.props

<PackageReference Include="ElBruno.LocalLLMs.Rag" />
 

 Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ElBruno.LocalLLMs.Rag --version 0.18.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ElBruno.LocalLLMs.Rag, 0.18.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package ElBruno.LocalLLMs.Rag@0.18.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=ElBruno.LocalLLMs.Rag&version=0.18.0
 

 Install as a Cake Addin

#tool nuget:?package=ElBruno.LocalLLMs.Rag&version=0.18.0
 

 Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

ElBruno.LocalLLMs

👁 NuGet
👁 NuGet Downloads
👁 Build Status
👁 HuggingFace
👁 .NET
👁 GitHub stars
👁 Twitter Follow

Run local LLMs in .NET through IChatClient 🧠

Run local LLMs in .NET through IChatClient — the same interface you'd use for Azure OpenAI, Ollama, or any other provider. Powered by ONNX Runtime GenAI and BitNet.

What's New

✅ Gemma 4 support path is now active (E2B, E4B, 12B Unified, 26B-A4B, 31B) via conversion workflows.
🔄 Gemma 4 status moved from pending to convert across the model tables and guides.
⬆️ ONNX Runtime GenAI upgraded to 0.14.1 across library, tests, samples, and benchmarks.

Features

🔌 IChatClient implementation — seamless integration with Microsoft.Extensions.AI
📦 Automatic model download — models are fetched from HuggingFace on first use
🚀 Zero friction — works out of the box with sensible defaults (Phi-3.5 mini)
🖥️ Multi-hardware — CPU, CUDA, and DirectML execution providers
💉 DI-friendly — register with AddLocalLLMs() or AddBitNetChatClient() in ASP.NET Core
🔄 Streaming — token-by-token streaming via GetStreamingResponseAsync
📊 Multi-model — switch between Phi-3.5, Phi-4, Qwen2.5, Llama 3.2, and more
🎯 Fine-tuned models — pre-trained Qwen2.5 variants for tool calling and RAG ()
⚡ BitNet support — run 1.58-bit ternary models via bitnet.cpp with extreme efficiency ()

Packages

Package	NuGet	Downloads	Description
`ElBruno.LocalLLMs`	👁 NuGet	👁 Downloads	Core library — ONNX Runtime GenAI models via IChatClient
`ElBruno.LocalLLMs.Rag`	👁 NuGet	👁 Downloads	RAG pipeline — document chunking, indexing, retrieval
`ElBruno.LocalLLMs.BitNet`	👁 NuGet	👁 Downloads	BitNet 1.58-bit models via bitnet.cpp + IChatClient

Installation

dotnet add package ElBruno.LocalLLMs

Then add one runtime package depending on your target hardware:

# 🖥️ CPU (works everywhere — required for CPU-only apps):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI

# 🟢 NVIDIA GPU (CUDA):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda

# 🔵 Any Windows GPU — AMD, Intel, NVIDIA (DirectML):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.DirectML

⚠️ Add exactly one runtime package. Do not reference both Microsoft.ML.OnnxRuntimeGenAI and Microsoft.ML.OnnxRuntimeGenAI.Cuda simultaneously — the native binaries conflict and GPU support will silently fail.

🚀 The library defaults to ExecutionProvider.Auto — it tries GPU first and falls back to CPU automatically. No code changes needed.

Quick Start

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

// Create a local chat client (downloads Phi-3.5 mini on first run)
using var client = await LocalChatClient.CreateAsync();

var response = await client.GetResponseAsync([
 new(ChatRole.User, "What is the capital of France?")
]);

Console.WriteLine(response.Text);

First Run

The first time you create a LocalChatClient, the model is downloaded from HuggingFace to your local cache directory (~2-4 GB). This typically takes 30-60 seconds depending on your internet connection.

Track download progress:

using var client = await LocalChatClient.CreateAsync(
 new LocalLLMsOptions { Model = KnownModels.Phi35MiniInstruct },
 progress: new Progress<ModelDownloadProgress>(p =>
 {
 var percent = (p.BytesDownloaded * 100) / p.TotalBytes;
 Console.WriteLine($"{p.FileName}: {percent:F1}%");
 })
);

Subsequent runs load instantly from cache (%LOCALAPPDATA%/ElBruno/LocalLLMs/models).

Skip auto-download if using a pre-downloaded model:

var options = new LocalLLMsOptions
{
 Model = KnownModels.Phi35MiniInstruct,
 ModelPath = "/path/to/local/model",
 EnsureModelDownloaded = false
};
using var client = await LocalChatClient.CreateAsync(options);

Streaming

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

using var client = await LocalChatClient.CreateAsync(new LocalLLMsOptions
{
 Model = KnownModels.Phi35MiniInstruct
});

await foreach (var update in client.GetStreamingResponseAsync([
 new(ChatRole.System, "You are a helpful assistant."),
 new(ChatRole.User, "Explain quantum computing in simple terms.")
]))
{
 Console.Write(update.Text);
}

GPU Acceleration

By default, ExecutionProvider.Auto tries GPU first (CUDA → DirectML) and falls back to CPU automatically:

// Use explicit GPU provider (fails if CUDA not installed; use Auto to fallback to CPU)
var options = new LocalLLMsOptions
{
 ExecutionProvider = ExecutionProvider.Cuda
};

// Multi-GPU systems: select device ID
var options2 = new LocalLLMsOptions
{
 ExecutionProvider = ExecutionProvider.Cuda,
 GpuDeviceId = 1 // Use second GPU
};

Auto fallback behavior:

CUDA available → uses NVIDIA GPU
CUDA unavailable, DirectML available → uses AMD/Intel Arc GPU
GPU unavailable → falls back to CPU (no errors, just slower)

See for debugging GPU issues.

Model Metadata

Inspect model capabilities at runtime — context window size, model name, and vocabulary:

using var client = await LocalChatClient.CreateAsync();

var metadata = client.ModelInfo;
Console.WriteLine($"Model: {metadata?.ModelName}");
Console.WriteLine($"Context window: {metadata?.MaxSequenceLength}");
Console.WriteLine($"Vocab size: {metadata?.VocabSize}");

This is useful for prompt-length validation, adaptive chunking, and model selection logic.

Dependency Injection

builder.Services.AddLocalLLMs(options =>
{
 options.Model = KnownModels.Phi35MiniInstruct;
 options.ExecutionProvider = ExecutionProvider.DirectML;
});

// Inject IChatClient anywhere
public class MyService(IChatClient chatClient) { ... }

Error Handling

The library provides structured exception types for graceful error handling:

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

try
{
 using var client = await LocalChatClient.CreateAsync();
 var response = await client.GetResponseAsync([
 new(ChatRole.User, "Your question here")
 ]);
}
catch (ExecutionProviderException ex)
{
 // GPU/provider-specific error (no CUDA, DirectML not available, etc.)
 Console.WriteLine($"Provider error: {ex.Message}");
}
catch (ModelCapacityExceededException ex)
{
 // Prompt/response too long for model's context window
 Console.WriteLine($"Capacity error: {ex.Message}");
 // Solution: use a larger model or truncate the prompt
}
catch (InvalidOperationException ex)
{
 // General operation error (model not found, download failed, etc.)
 Console.WriteLine($"Operation error: {ex.Message}");
}

Troubleshooting

GPU not working? Use ExecutionProvider.Cpu explicitly. See .

Out of memory? Try a smaller model:

var options = new LocalLLMsOptions
{
 Model = KnownModels.Qwen25_05BInstruct // 0.5B instead of 3.8B
};

Model download fails?

Check your internet connection
For private HuggingFace models, set the HF_TOKEN environment variable

For detailed troubleshooting, see .

Supported Models

Tier	Model	Parameters	ONNX	ID
⚪ Tiny	TinyLlama-1.1B-Chat	1.1B	✅ Native	`tinyllama-1.1b-chat`
⚪ Tiny	SmolLM2-1.7B-Instruct	1.7B	✅ Native	`smollm2-1.7b-instruct`
⚪ Tiny	Qwen2.5-0.5B-Instruct	0.5B	✅ Native	`qwen2.5-0.5b-instruct`
⚪ Tiny	Qwen2.5-1.5B-Instruct	1.5B	✅ Native	`qwen2.5-1.5b-instruct`
⚪ Tiny	Gemma-2B-IT	2B	✅ Native	`gemma-2b-it`
⚪ Tiny	Gemma-4-E2B-IT	5.1B (2B active)	🔄 Convert	`gemma-4-e2b-it`
⚪ Tiny	StableLM-2-1.6B-Chat	1.6B	🔄 Convert	`stablelm-2-1.6b-chat`
🟢 Small	Phi-3.5 mini instruct	3.8B	✅ Native	`phi-3.5-mini-instruct`
🟢 Small	Qwen2.5-3B-Instruct	3B	✅ Native	`qwen2.5-3b-instruct`
🟢 Small	Llama-3.2-3B-Instruct	3B	✅ Native	`llama-3.2-3b-instruct`
🟢 Small	Gemma-2-2B-IT	2B	✅ Native	`gemma-2-2b-it`
🟢 Small	Gemma-4-E4B-IT	8B (4B active)	🔄 Convert	`gemma-4-e4b-it`
🟡 Medium	Qwen2.5-7B-Instruct	7B	✅ Native	`qwen2.5-7b-instruct`
🟡 Medium	Llama-3.1-8B-Instruct	8B	✅ Native	`llama-3.1-8b-instruct`
🟡 Medium	Mistral-7B-Instruct-v0.3	7B	✅ Native	`mistral-7b-instruct-v0.3`
🟡 Medium	Gemma-2-9B-IT	9B	✅ Native	`gemma-2-9b-it`
🟡 Medium	Gemma-4-12B-IT	12B	🔄 Convert	`gemma-4-12b-it`
🟡 Medium	Phi-4	14B	✅ Native	`phi-4`
🟡 Medium	DeepSeek-R1-Distill-Qwen-14B	14B	✅ Native	`deepseek-r1-distill-qwen-14b`
🟡 Medium	Mistral-Small-24B-Instruct	24B	✅ Native	`mistral-small-24b-instruct`
🔴 Large	Qwen2.5-14B-Instruct	14B	✅ Native	`qwen2.5-14b-instruct`
🔴 Large	Qwen2.5-32B-Instruct	32B	✅ Native	`qwen2.5-32b-instruct`
🔴 Large	Llama-3.3-70B-Instruct	70B	✅ ONNX	`llama-3.3-70b-instruct`
🔴 Large	Mixtral-8x7B-Instruct-v0.1	8x7B	🔄 Convert	`mixtral-8x7b-instruct-v0.1`
🔴 Large	DeepSeek-R1-Distill-Llama-70B	70B	🔄 Convert	`deepseek-r1-distill-llama-70b`
🔴 Large	Command-R (35B)	35B	🔄 Convert	`command-r-35b`
🔴 Large	Gemma-4-26B-A4B-IT	25.2B (3.8B active)	🔄 Convert	`gemma-4-26b-a4b-it`
🔴 Large	Gemma-4-31B-IT	30.7B	🔄 Convert	`gemma-4-31b-it`

🔄 Convert = Use the conversion scripts in scripts/ to export ONNX locally before running the model.

Fine-Tuned Models

Pre-trained variants optimized for specific tasks. A fine-tuned 0.5B model often matches or exceeds a base 1.5B on its specialized task.

Model	Size	Task	HuggingFace ID
Qwen2.5-0.5B-ToolCalling	~1 GB	Tool/function calling	`elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling`
Qwen2.5-0.5B-RAG	~1 GB	RAG with citations	`elbruno/Qwen2.5-0.5B-LocalLLMs-RAG`
Qwen2.5-0.5B-Instruct	~1 GB	General-purpose	`elbruno/Qwen2.5-0.5B-LocalLLMs-Instruct`

See the for detailed model cards, performance benchmarks, and selection guidance.

Samples

Sample	Description
Minimal console chat
Token-by-token streaming
Switch models at runtime
ASP.NET Core DI registration
Function calling and tool use
Fine-tuned model for improved tool calling
RAG pipeline with document retrieval
Zero-cloud RAG pipeline with real local embeddings and LLM inference
BitNet 1.58-bit model chat completion
Performance benchmark: BitNet vs ONNX models
Interactive console application

Requirements

.NET 8.0 or .NET 10.0
CPU (default), NVIDIA GPU (CUDA), or Windows GPU (DirectML)
~2-8 GB disk space per model (depending on size and quantization)

Building from Source

git clone https://github.com/elbruno/ElBruno.LocalLLMs.git
cd ElBruno.LocalLLMs
dotnet restore ElBruno.LocalLLMs.slnx
dotnet build ElBruno.LocalLLMs.slnx
dotnet test ElBruno.LocalLLMs.slnx --framework net8.0

Documentation

— installation, first steps, configuration
— full model reference with tiers, specs, decision tree
— setup and usage of 1.58-bit BitNet models
— design decisions and internal structure
— walkthrough of each sample application
— how to run and interpret performance benchmarks
— using and training fine-tuned models
— converting HuggingFace models to ONNX format
— NuGet package publishing with OIDC
— how to contribute
— version history

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License — see the file for details.

👋 About the Author

Made with ❤️ by Bruno Capuano (ElBruno)

📝 Blog: elbruno.com
📺 YouTube: youtube.com/elbruno
🔗 LinkedIn: linkedin.com/in/elbruno
𝕏 Twitter: twitter.com/elbruno
🎙️ Podcast: notienenombre.com

🙏 Acknowledgments

ONNX Runtime GenAI — inference engine
BitNet / bitnet.cpp — 1.58-bit ternary model inference
Microsoft.Extensions.AI — IChatClient interface
Hugging Face — model hosting and community

Product	Versions Compatible and additional computed target framework versions.
.NET	net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 is compatible. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.

Product

Versions Compatible and additional computed target framework versions.

.NET

net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 is compatible. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- ElBruno.LocalLLMs (>= 0.18.0)
- Microsoft.Data.Sqlite (>= 9.0.3)
- Microsoft.Extensions.AI.Abstractions (>= 10.4.0)
net8.0
- ElBruno.LocalLLMs (>= 0.18.0)
- Microsoft.Data.Sqlite (>= 9.0.3)
- Microsoft.Extensions.AI.Abstractions (>= 10.4.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.18.0	95	6/5/2026
0.17.0	88	6/3/2026
0.16.0	106	4/17/2026
0.15.0	101	4/16/2026
0.11.0	116	4/4/2026

URL: https://www.nuget.org/packages/ElBruno.LocalLLMs.Rag

⇱ NuGet Gallery | ElBruno.LocalLLMs.Rag 0.18.0

👁 Image
ElBruno.LocalLLMs.Rag 0.18.0

ElBruno.LocalLLMs

Run local LLMs in .NET through IChatClient 🧠

What's New

Features

Packages

Installation

Quick Start

First Run

Streaming

GPU Acceleration

Model Metadata

Dependency Injection

Error Handling

Troubleshooting

Supported Models

Fine-Tuned Models

Samples

Requirements

Building from Source

Documentation

🤝 Contributing

📄 License

👋 About the Author

🙏 Acknowledgments

net10.0

net8.0

NuGet packages

GitHub repositories

URL: https://www.nuget.org/packages/ElBruno.LocalLLMs.Rag

⇱ NuGet Gallery | ElBruno.LocalLLMs.Rag 0.18.0

👁 Image ElBruno.LocalLLMs.Rag 0.18.0

ElBruno.LocalLLMs

Run local LLMs in .NET through IChatClient 🧠

What's New

Features

Packages

Installation

Quick Start

First Run

Streaming

GPU Acceleration

Model Metadata

Dependency Injection

Error Handling

Troubleshooting

Supported Models

Fine-Tuned Models

Samples

Requirements

Building from Source

Documentation

🤝 Contributing

📄 License

👋 About the Author

🙏 Acknowledgments

net10.0

net8.0

NuGet packages

GitHub repositories

👁 Image
ElBruno.LocalLLMs.Rag 0.18.0