![]() |
VOOZH | about |
dotnet add package ElBruno.LocalLLMs --version 0.18.0
NuGet\Install-Package ElBruno.LocalLLMs -Version 0.18.0
<PackageReference Include="ElBruno.LocalLLMs" Version="0.18.0" />
<PackageVersion Include="ElBruno.LocalLLMs" Version="0.18.0" />Directory.Packages.props
<PackageReference Include="ElBruno.LocalLLMs" />Project file
paket add ElBruno.LocalLLMs --version 0.18.0
#r "nuget: ElBruno.LocalLLMs, 0.18.0"
#:package ElBruno.LocalLLMs@0.18.0
#addin nuget:?package=ElBruno.LocalLLMs&version=0.18.0Install as a Cake Addin
#tool nuget:?package=ElBruno.LocalLLMs&version=0.18.0Install as a Cake Tool
👁 NuGet
👁 NuGet Downloads
👁 Build Status
👁 HuggingFace
👁 .NET
👁 GitHub stars
👁 Twitter Follow
Run local LLMs in .NET through IChatClient — the same interface you'd use for Azure OpenAI, Ollama, or any other provider. Powered by ONNX Runtime GenAI and BitNet.
E2B, E4B, 12B Unified, 26B-A4B, 31B) via conversion workflows.0.14.1 across library, tests, samples, and benchmarks.IChatClient implementation — seamless integration with Microsoft.Extensions.AIAddLocalLLMs() or AddBitNetChatClient() in ASP.NET CoreGetStreamingResponseAsync| Package | NuGet | Downloads | Description |
|---|---|---|---|
ElBruno.LocalLLMs |
👁 NuGet |
👁 Downloads |
Core library — ONNX Runtime GenAI models via IChatClient |
ElBruno.LocalLLMs.Rag |
👁 NuGet |
👁 Downloads |
RAG pipeline — document chunking, indexing, retrieval |
ElBruno.LocalLLMs.BitNet |
👁 NuGet |
👁 Downloads |
BitNet 1.58-bit models via bitnet.cpp + IChatClient |
dotnet add package ElBruno.LocalLLMs
Then add one runtime package depending on your target hardware:
# 🖥️ CPU (works everywhere — required for CPU-only apps):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI
# 🟢 NVIDIA GPU (CUDA):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda
# 🔵 Any Windows GPU — AMD, Intel, NVIDIA (DirectML):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.DirectML
⚠️ Add exactly one runtime package. Do not reference both
Microsoft.ML.OnnxRuntimeGenAIandMicrosoft.ML.OnnxRuntimeGenAI.Cudasimultaneously — the native binaries conflict and GPU support will silently fail.
🚀 The library defaults to
ExecutionProvider.Auto— it tries GPU first and falls back to CPU automatically. No code changes needed.
using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;
// Create a local chat client (downloads Phi-3.5 mini on first run)
using var client = await LocalChatClient.CreateAsync();
var response = await client.GetResponseAsync([
new(ChatRole.User, "What is the capital of France?")
]);
Console.WriteLine(response.Text);
The first time you create a LocalChatClient, the model is downloaded from HuggingFace to your local cache directory (~2-4 GB). This typically takes 30-60 seconds depending on your internet connection.
Track download progress:
using var client = await LocalChatClient.CreateAsync(
new LocalLLMsOptions { Model = KnownModels.Phi35MiniInstruct },
progress: new Progress<ModelDownloadProgress>(p =>
{
var percent = (p.BytesDownloaded * 100) / p.TotalBytes;
Console.WriteLine($"{p.FileName}: {percent:F1}%");
})
);
Subsequent runs load instantly from cache (%LOCALAPPDATA%/ElBruno/LocalLLMs/models).
Skip auto-download if using a pre-downloaded model:
var options = new LocalLLMsOptions
{
Model = KnownModels.Phi35MiniInstruct,
ModelPath = "/path/to/local/model",
EnsureModelDownloaded = false
};
using var client = await LocalChatClient.CreateAsync(options);
using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;
using var client = await LocalChatClient.CreateAsync(new LocalLLMsOptions
{
Model = KnownModels.Phi35MiniInstruct
});
await foreach (var update in client.GetStreamingResponseAsync([
new(ChatRole.System, "You are a helpful assistant."),
new(ChatRole.User, "Explain quantum computing in simple terms.")
]))
{
Console.Write(update.Text);
}
By default, ExecutionProvider.Auto tries GPU first (CUDA → DirectML) and falls back to CPU automatically:
// Use explicit GPU provider (fails if CUDA not installed; use Auto to fallback to CPU)
var options = new LocalLLMsOptions
{
ExecutionProvider = ExecutionProvider.Cuda
};
// Multi-GPU systems: select device ID
var options2 = new LocalLLMsOptions
{
ExecutionProvider = ExecutionProvider.Cuda,
GpuDeviceId = 1 // Use second GPU
};
Auto fallback behavior:
See for debugging GPU issues.
Inspect model capabilities at runtime — context window size, model name, and vocabulary:
using var client = await LocalChatClient.CreateAsync();
var metadata = client.ModelInfo;
Console.WriteLine($"Model: {metadata?.ModelName}");
Console.WriteLine($"Context window: {metadata?.MaxSequenceLength}");
Console.WriteLine($"Vocab size: {metadata?.VocabSize}");
This is useful for prompt-length validation, adaptive chunking, and model selection logic.
builder.Services.AddLocalLLMs(options =>
{
options.Model = KnownModels.Phi35MiniInstruct;
options.ExecutionProvider = ExecutionProvider.DirectML;
});
// Inject IChatClient anywhere
public class MyService(IChatClient chatClient) { ... }
The library provides structured exception types for graceful error handling:
using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;
try
{
using var client = await LocalChatClient.CreateAsync();
var response = await client.GetResponseAsync([
new(ChatRole.User, "Your question here")
]);
}
catch (ExecutionProviderException ex)
{
// GPU/provider-specific error (no CUDA, DirectML not available, etc.)
Console.WriteLine($"Provider error: {ex.Message}");
}
catch (ModelCapacityExceededException ex)
{
// Prompt/response too long for model's context window
Console.WriteLine($"Capacity error: {ex.Message}");
// Solution: use a larger model or truncate the prompt
}
catch (InvalidOperationException ex)
{
// General operation error (model not found, download failed, etc.)
Console.WriteLine($"Operation error: {ex.Message}");
}
GPU not working? Use ExecutionProvider.Cpu explicitly. See .
Out of memory? Try a smaller model:
var options = new LocalLLMsOptions
{
Model = KnownModels.Qwen25_05BInstruct // 0.5B instead of 3.8B
};
Model download fails?
HF_TOKEN environment variableFor detailed troubleshooting, see .
| Tier | Model | Parameters | ONNX | ID |
|---|---|---|---|---|
| ⚪ Tiny | TinyLlama-1.1B-Chat | 1.1B | ✅ Native | tinyllama-1.1b-chat |
| ⚪ Tiny | SmolLM2-1.7B-Instruct | 1.7B | ✅ Native | smollm2-1.7b-instruct |
| ⚪ Tiny | Qwen2.5-0.5B-Instruct | 0.5B | ✅ Native | qwen2.5-0.5b-instruct |
| ⚪ Tiny | Qwen2.5-1.5B-Instruct | 1.5B | ✅ Native | qwen2.5-1.5b-instruct |
| ⚪ Tiny | Gemma-2B-IT | 2B | ✅ Native | gemma-2b-it |
| ⚪ Tiny | Gemma-4-E2B-IT | 5.1B (2B active) | 🔄 Convert | gemma-4-e2b-it |
| ⚪ Tiny | StableLM-2-1.6B-Chat | 1.6B | 🔄 Convert | stablelm-2-1.6b-chat |
| 🟢 Small | Phi-3.5 mini instruct | 3.8B | ✅ Native | phi-3.5-mini-instruct |
| 🟢 Small | Qwen2.5-3B-Instruct | 3B | ✅ Native | qwen2.5-3b-instruct |
| 🟢 Small | Llama-3.2-3B-Instruct | 3B | ✅ Native | llama-3.2-3b-instruct |
| 🟢 Small | Gemma-2-2B-IT | 2B | ✅ Native | gemma-2-2b-it |
| 🟢 Small | Gemma-4-E4B-IT | 8B (4B active) | 🔄 Convert | gemma-4-e4b-it |
| 🟡 Medium | Qwen2.5-7B-Instruct | 7B | ✅ Native | qwen2.5-7b-instruct |
| 🟡 Medium | Llama-3.1-8B-Instruct | 8B | ✅ Native | llama-3.1-8b-instruct |
| 🟡 Medium | Mistral-7B-Instruct-v0.3 | 7B | ✅ Native | mistral-7b-instruct-v0.3 |
| 🟡 Medium | Gemma-2-9B-IT | 9B | ✅ Native | gemma-2-9b-it |
| 🟡 Medium | Gemma-4-12B-IT | 12B | 🔄 Convert | gemma-4-12b-it |
| 🟡 Medium | Phi-4 | 14B | ✅ Native | phi-4 |
| 🟡 Medium | DeepSeek-R1-Distill-Qwen-14B | 14B | ✅ Native | deepseek-r1-distill-qwen-14b |
| 🟡 Medium | Mistral-Small-24B-Instruct | 24B | ✅ Native | mistral-small-24b-instruct |
| 🔴 Large | Qwen2.5-14B-Instruct | 14B | ✅ Native | qwen2.5-14b-instruct |
| 🔴 Large | Qwen2.5-32B-Instruct | 32B | ✅ Native | qwen2.5-32b-instruct |
| 🔴 Large | Llama-3.3-70B-Instruct | 70B | ✅ ONNX | llama-3.3-70b-instruct |
| 🔴 Large | Mixtral-8x7B-Instruct-v0.1 | 8x7B | 🔄 Convert | mixtral-8x7b-instruct-v0.1 |
| 🔴 Large | DeepSeek-R1-Distill-Llama-70B | 70B | 🔄 Convert | deepseek-r1-distill-llama-70b |
| 🔴 Large | Command-R (35B) | 35B | 🔄 Convert | command-r-35b |
| 🔴 Large | Gemma-4-26B-A4B-IT | 25.2B (3.8B active) | 🔄 Convert | gemma-4-26b-a4b-it |
| 🔴 Large | Gemma-4-31B-IT | 30.7B | 🔄 Convert | gemma-4-31b-it |
🔄 Convert = Use the conversion scripts in
scripts/to export ONNX locally before running the model.
Pre-trained variants optimized for specific tasks. A fine-tuned 0.5B model often matches or exceeds a base 1.5B on its specialized task.
| Model | Size | Task | HuggingFace ID |
|---|---|---|---|
| Qwen2.5-0.5B-ToolCalling | ~1 GB | Tool/function calling | elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling |
| Qwen2.5-0.5B-RAG | ~1 GB | RAG with citations | elbruno/Qwen2.5-0.5B-LocalLLMs-RAG |
| Qwen2.5-0.5B-Instruct | ~1 GB | General-purpose | elbruno/Qwen2.5-0.5B-LocalLLMs-Instruct |
See the for detailed model cards, performance benchmarks, and selection guidance.
| Sample | Description |
|---|---|
| Minimal console chat | |
| Token-by-token streaming | |
| Switch models at runtime | |
| ASP.NET Core DI registration | |
| Function calling and tool use | |
| Fine-tuned model for improved tool calling | |
| RAG pipeline with document retrieval | |
| Zero-cloud RAG pipeline with real local embeddings and LLM inference | |
| BitNet 1.58-bit model chat completion | |
| Performance benchmark: BitNet vs ONNX models | |
| Interactive console application |
git clone https://github.com/elbruno/ElBruno.LocalLLMs.git
cd ElBruno.LocalLLMs
dotnet restore ElBruno.LocalLLMs.slnx
dotnet build ElBruno.LocalLLMs.slnx
dotnet test ElBruno.LocalLLMs.slnx --framework net8.0
Contributions are welcome! Please:
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)This project is licensed under the MIT License — see the file for details.
Made with ❤️ by Bruno Capuano (ElBruno)
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 is compatible. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
Showing the top 3 NuGet packages that depend on ElBruno.LocalLLMs:
| Package | Downloads |
|---|---|
|
ElBruno.ModelContextProtocol.MCPToolRouter
Semantic routing for Model Context Protocol (MCP) tool definitions using local embeddings. Indexes MCP tools and returns the most relevant tools for a given prompt via vector search. |
|
|
ElBruno.LocalLLMs.Rag
RAG (Retrieval-Augmented Generation) pipeline for ElBruno.LocalLLMs. Provides document chunking, embedding storage, and semantic search. |
|
|
ElBruno.LocalLLMs.BitNet
BitNet 1.58-bit LLM inference using bitnet.cpp. IChatClient implementation for Microsoft.Extensions.AI. |
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.18.0 | 140 | 6/5/2026 |
| 0.17.0 | 129 | 6/3/2026 |
| 0.16.0 | 205 | 4/17/2026 |
| 0.15.0 | 146 | 4/16/2026 |
| 0.11.0 | 149 | 4/4/2026 |
| 0.9.0 | 120 | 4/4/2026 |
| 0.7.2 | 201 | 3/28/2026 |
| 0.7.1 | 132 | 3/28/2026 |
| 0.7.0 | 122 | 3/28/2026 |
| 0.6.1 | 128 | 3/28/2026 |
| 0.6.0 | 123 | 3/28/2026 |
| 0.5.0 | 163 | 3/28/2026 |
| 0.1.8 | 111 | 3/19/2026 |
| 0.1.7 | 108 | 3/18/2026 |
| 0.1.6 | 106 | 3/18/2026 |
| 0.1.0 | 109 | 3/18/2026 |