VOOZH about

URL: https://www.nuget.org/packages/Tokenizers.HuggingFace/

⇱ NuGet Gallery | Tokenizers.HuggingFace 3.23.1




Tokenizers.HuggingFace 3.23.1

dotnet add package Tokenizers.HuggingFace --version 3.23.1
 
 
NuGet\Install-Package Tokenizers.HuggingFace -Version 3.23.1
 
 
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Tokenizers.HuggingFace" Version="3.23.1" />
 
 
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Tokenizers.HuggingFace" Version="3.23.1" />
 
Directory.Packages.props
<PackageReference Include="Tokenizers.HuggingFace" />
 
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Tokenizers.HuggingFace --version 3.23.1
 
 
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: Tokenizers.HuggingFace, 3.23.1"
 
 
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Tokenizers.HuggingFace@3.23.1
 
 
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Tokenizers.HuggingFace&version=3.23.1
 
Install as a Cake Addin
#tool nuget:?package=Tokenizers.HuggingFace&version=3.23.1
 
Install as a Cake Tool
The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Tokenizers.HuggingFace

.NET bindings for huggingface/tokenizers using protobufs for communication and C-ABI.

How to install

dotnet add package Tokenizers.HuggingFace

Supported targets

  • linux-arm64
  • linux-x64
  • osx-arm64
  • osx-x64
  • win-x64
  • win-arm64

Usage

Cases:

  • Normalization
  • PreTokenization
  • Tokenizer (Encode, Decode, Load From File, Train)

Examples

Basic Tokenization from file
using Tokenizers.HuggingFace.Tokenizer;

var tk = Tokenizer.FromFile("./tokenizer.json");
var encodings = tk.Encode("Hello, World!", true).First();
Console.WriteLine($"{string.Join(",", encodings.Ids)}");
// Optionally dispose the tokenizer if no longer needed
// If not disposed, it will be cleaned up by the finalizer
tk.Dispose();
Test Pipeline with normalization and pretokenization
var lowerCase = new Tokenizers.HuggingFace.Normalizers.Lowercase();
Tokenizers.HuggingFace.Normalizers.Sequence normalizer = new([
 new Tokenizers.HuggingFace.Normalizers.Nfd(),
 lowerCase,
 new Tokenizers.HuggingFace.Normalizers.StripAccents()
]);
// Optionally dispose the normalizer if no longer needed
// If not disposed, it will be cleaned up by the finalizer
// Disposing this won't affect the sequence we created
lowerCase.Dispose();
Tokenizers.HuggingFace.PreTokenizers.Bert bert = new();
var testString = new Tokenizers.HuggingFace.PipelineString.PipelineString("H�llo, W�rld!");
normalizer.Normalize(testString);
bert.PreTokenize(testString);
var splits = testString.GetSplits(
 Tokenizers.HuggingFace.PipelineString.OffsetReferential.Original,
 Tokenizers.HuggingFace.PipelineString.OffsetType.Char,
 includeOffsets: true
);
Console.WriteLine($"Tokens: [{string.Join(",", splits.Select(split=> $"'{split.Item1}'"))}]");
Console.WriteLine($"Offsets: [{string.Join(",", splits.Select(split => split.Item2))}]");
bert.Dispose();
normalizer.Dispose();
testString.Dispose();
Train a all-together-a-bert-tokenizer-from-scratch
var normalizer = new Tokenizers.HuggingFace.Normalizers.Sequence([
 new Tokenizers.HuggingFace.Normalizers.Nfd(),
 new Tokenizers.HuggingFace.Normalizers.Lowercase(),
 new Tokenizers.HuggingFace.Normalizers.StripAccents(),
]);
var preTokenizer = new Tokenizers.HuggingFace.PreTokenizers.Whitespace();
Tokenizers.HuggingFace.Processors.Token[] tokensProcessor = [
 new() { TokenPair = new() { Token = "[CLS]", TokenId = 1 } },
 new() { TokenPair = new() { Token = "[SEP]", TokenId = 2 } },
];
var processor = new Tokenizers.HuggingFace.Processors.TemplateProcessing()
{
 Single = "[CLS] $A [SEP]",
 Pair = "[CLS] $A [SEP] $B:1 [SEP]:1",
};
processor.Tokens = new();
processor.Tokens.Tokens_.AddRange(tokensProcessor);
Tokenizers.HuggingFace.Trainers.AddedToken[] tokensTrainer = [
 new() { Content = "[UNK]", Special = true },
 new() { Content = "[CLS]", Special = true },
 new() { Content = "[SEP]", Special = true },
 new() { Content = "[PAD]", Special = true },
 new() { Content = "[MASK]", Special = true },
];
var trainer = new Tokenizers.HuggingFace.Trainers.WordPieceTrainer() { VocabSize = 30522 };
trainer.SpecialTokens.AddRange(tokensTrainer);
var tk = Tokenizers.HuggingFace.Tokenizer.Tokenizer.FromTrain(
 files: ["corpus.txt"],
 savePath: "my_tokenizer.json",
 model: new() { WordPiece = new() }, // by default uses [UNK]
 trainer: new() { WordPiece = trainer },
 // From here on are optional parameters
 //normalizer: normalizer,
 //preTokenizer: preTokenizer,
 processors: [new() { TemplateProcessing = processor }]
);
Sentence Similarity with sentence-transformers/all-MiniLM-L6-v2

Steps:

  • Create Console app
dotnet new console --name Sentences
dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Tokenizers.HuggingFace
  • Add the following code to Program.cs
using System.Numerics.Tensors;

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

using Tokenizers.HuggingFace.Tokenizer;


var a = SentenceSimilarityModel.GetEmbeddings("Hello, world");
var b = SentenceSimilarityModel.GetEmbeddings("Hello, world, good to be here");

Console.WriteLine($"E: {string.Join(',', a)}");
Console.WriteLine($"a-b: {TensorPrimitives.CosineSimilarity(a, b)}");

static class SentenceSimilarityModel
{
 static readonly Tokenizer tk = Tokenizer.FromFile("./tokenizer.json");
 static readonly InferenceSession session = new InferenceSession("./model.onnx");
 static (int, NamedOnnxValue[]) PrepareInputs(string text)
 {
 var encodings = tk.Encode(text, true, includeTypeIds: true, includeAttentionMask: true).First();
 var sequenceLenght = encodings.Ids.Count;
 var input_ids = new DenseTensor<long>(encodings.Ids.Select(t => (long)t).ToArray(), [1, sequenceLenght]);
 var type_ids = new DenseTensor<long>(encodings.TypeIds.Select(t => (long)t).ToArray(), [1, sequenceLenght]);
 var attention_mask = new DenseTensor<long>(encodings.AttentionMask.Select(t => (long)t).ToArray(), [1, sequenceLenght]);

 return (sequenceLenght, [
 NamedOnnxValue.CreateFromTensor("input_ids", input_ids),
 NamedOnnxValue.CreateFromTensor("token_type_ids", type_ids),
 NamedOnnxValue.CreateFromTensor("attention_mask", attention_mask)
 ]);
 }
 static public float[] GetEmbeddings(string text)
 {
 var (sequenceLenght, inputs) = PrepareInputs(text);
 using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = session.Run(inputs);
 var outputTensor = results.First().AsEnumerable<float>().ToArray();
 float[] result = new float[384];
 for (int i = 0; i < sequenceLenght; i++)
 {
 ReadOnlySpan<float> floats = new ReadOnlySpan<float>(outputTensor, i*384, 384);
 TensorPrimitives.Add(floats, result, result);
 }
 TensorPrimitives.Divide(result, sequenceLenght, result);
 return result;
 }
}

Releasing

If you know the target target you are building your project for use:

dotnet build .\YourProject.csproj -c Release -r [target]

This way you avoid including all native libraries.

Product Versions Compatible and additional computed target framework versions.
.NET net6.0 net6.0 is compatible.  net6.0-android net6.0-android was computed.  net6.0-ios net6.0-ios was computed.  net6.0-maccatalyst net6.0-maccatalyst was computed.  net6.0-macos net6.0-macos was computed.  net6.0-tvos net6.0-tvos was computed.  net6.0-windows net6.0-windows was computed.  net7.0 net7.0 was computed.  net7.0-android net7.0-android was computed.  net7.0-ios net7.0-ios was computed.  net7.0-maccatalyst net7.0-maccatalyst was computed.  net7.0-macos net7.0-macos was computed.  net7.0-tvos net7.0-tvos was computed.  net7.0-windows net7.0-windows was computed.  net8.0 net8.0 was computed.  net8.0-android net8.0-android was computed.  net8.0-browser net8.0-browser was computed.  net8.0-ios net8.0-ios was computed.  net8.0-maccatalyst net8.0-maccatalyst was computed.  net8.0-macos net8.0-macos was computed.  net8.0-tvos net8.0-tvos was computed.  net8.0-windows net8.0-windows was computed.  net9.0 net9.0 was computed.  net9.0-android net9.0-android was computed.  net9.0-browser net9.0-browser was computed.  net9.0-ios net9.0-ios was computed.  net9.0-maccatalyst net9.0-maccatalyst was computed.  net9.0-macos net9.0-macos was computed.  net9.0-tvos net9.0-tvos was computed.  net9.0-windows net9.0-windows was computed.  net10.0 net10.0 was computed.  net10.0-android net10.0-android was computed.  net10.0-browser net10.0-browser was computed.  net10.0-ios net10.0-ios was computed.  net10.0-maccatalyst net10.0-maccatalyst was computed.  net10.0-macos net10.0-macos was computed.  net10.0-tvos net10.0-tvos was computed.  net10.0-windows net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (6)

Showing the top 5 NuGet packages that depend on Tokenizers.HuggingFace:

Package Downloads
SentenceTransformers.Qwen3

The wrapper provides a simple and easy-to-use interface for loading the Qwen3 embeddings model and generating embeddings for input text.

AllMpnetBaseV2Sharp

C# implementation of sentence-transformers/all-mpnet-base-v2 using ONNX Runtime and HuggingFace tokenizers.

SentenceTransformers.Harrier.Small

The wrapper provides a simple and easy-to-use interface for loading the Harrier Small (harrier-oss-v1-270m) multilingual embeddings model and generating embeddings for input text.

SentenceTransformers.Harrier.Medium

The wrapper provides a simple and easy-to-use interface for loading the Harrier Medium (harrier-oss-v1-0.6b) multilingual embeddings model and generating embeddings for input text.

SentenceTransformers.Harrier

The wrapper provides a simple and easy-to-use interface for loading the Harrier (harrier-oss-v1-0.6b) multilingual embeddings model and generating embeddings for input text.

GitHub repositories (1)

Showing the top 1 popular GitHub repositories that depend on Tokenizers.HuggingFace:

Repository Stars
axzxs2001/Asp.NetCoreExperiment
原来所有项目都移动到**OleVersion**目录下进行保留。新的案例装以.net 5.0为主,一部分对以前案例进行升级,一部分将以前的工作经验总结出来,以供大家参考!
Version Downloads Last Updated
3.23.1 1,510 5/12/2026
2.23.1 113 5/11/2026
2.21.4 13,243 9/4/2025
2.21.4-rc.0 247 8/30/2025
1.21.4 894 8/10/2025
1.21.4-rc.1 188 8/17/2025
1.0.1-experimental.1 251 6/26/2025
1.0.0 1,421 5/10/2025
0.1.0 326 5/8/2025