👁 Image
FastBertTokenizer 1.0.28

There is a newer prerelease version of this package available.
See the version list below for details.

dotnet add package FastBertTokenizer --version 1.0.28

NuGet\Install-Package FastBertTokenizer -Version 1.0.28

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="FastBertTokenizer" Version="1.0.28" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="FastBertTokenizer" Version="1.0.28" />
 

 Directory.Packages.props

<PackageReference Include="FastBertTokenizer" />
 

 Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add FastBertTokenizer --version 1.0.28

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: FastBertTokenizer, 1.0.28"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package FastBertTokenizer@1.0.28

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=FastBertTokenizer&version=1.0.28
 

 Install as a Cake Addin

#tool nuget:?package=FastBertTokenizer&version=1.0.28
 

 Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

FastBertTokenizer

👁 NuGet version (FastBertTokenizer)
👁 .NET Build
👁 codecov

A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks. Native AOT compatible and support for netstandard2.0.

Goals

Enabling you to run your AI workloads on .NET in production.
Correctness - Results that are equivalent to HuggingFace Transformers' AutoTokenizer's in all practical cases.
Speed - Tokenization should be as fast as reasonably possible.
Ease of use - The API should be easy to understand and use.

Getting Started

dotnet new console
dotnet add package FastBertTokenizer

using FastBertTokenizer;

var tok = new BertTokenizer();
await tok.LoadFromHuggingFaceAsync("bert-base-uncased");
var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet.");
Console.WriteLine(string.Join(", ", inputIds.ToArray()));
var decoded = tok.Decode(inputIds.Span);
Console.WriteLine(decoded);

// Output:
// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102
// [CLS] lorem ipsum dolor sit amet. [SEP]

Comparison to BERTTokenizers

about 1 order of magnitude faster
allocates more than 1 order of magnitude less memory
better whitespace handling
handles unknown characters correctly
does not throw if text is longer than maximum sequence length
handles unicode control chars
handles other alphabets such as greek and right-to-left languages

Note that while BERTTokenizers handles token type incorrectly, it does support input of two pieces of text that are tokenized with a separator in between. FastBertTokenizer currently does not support this.

Speed / Benchmarks

tl;dr: FastBertTokenizer can encode 1 GB of text in around 2 s on a typical notebook CPU from 2020.

All benchmarks were performed on a typical end user notebook, a ThinkPad T14s Gen 1:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3527/23H2/2023Update/SunValley3)
AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.204

Similar results can also be observed using GitHub Actions. Note that using shared CI runners for benchmarking has drawbacks and can lead to varying results though.

on NET 6.0 vs. on NET 8.0

.NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2 vs .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Workload: Encode up to 512 tokens from each of 15,000 articles from simple english wikipedia.
Results: Total tokens produced: 3,657,145; on .NET 8: ~11m tokens/s single threaded, 73m tokens/s multi threaded.

Method	Runtime	Mean	Error	StdDev	Ratio	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
Singlethreaded	.NET 6.0	450.39 ms	7.340 ms	6.866 ms	1.00	-	-	-	2 MB	1.00
MultithreadedMemReuseBatched	.NET 6.0	72.46 ms	1.337 ms	1.251 ms	0.16	750.0000	250.0000	250.0000	12.75 MB	6.39
Singlethreaded	.NET 8.0	332.51 ms	6.574 ms	7.826 ms	1.00	-	-	-	1.99 MB	1.00
MultithreadedMemReuseBatched	.NET 8.0	50.83 ms	0.999 ms	1.995 ms	0.15	500.0000	-	-	12.75 MB	6.40

vs. SharpToken

SharpToken v2.0.2
.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Workload: Fully encode 15,000 articles from simple english wikipedia. Total tokens produced by FastBertTokenizer: 5,807,949 (~9.4m tokens/s single threaded).

This isn't an apples to apples comparison as BPE (what SharpToken does) and WordPiece encoding (what FastBertTokenizer does) are different tasks/algorithms. Both were applied to exactly the same texts/corpus though.

Method	Mean	Error	StdDev	Gen0	Gen1	Allocated
SharpTokenFullArticles	1,551.9 ms	25.82 ms	24.15 ms	5000.0000	2000.0000	32.56 MB
FastBertTokenizerFullArticles	620.3 ms	7.00 ms	6.21 ms	-	-	2.26 MB

vs. HuggingFace tokenizers (Rust)

tokenizers v0.19.1

I'm not really experienced in benchmarking rust code, but my attempts using criterion.rs (see src/HuggingfaceTokenizer/BenchRust) suggest that it takes tokenizers around

single threaded: ~2 s (~2.9m tokens/s)
batched/multi threaded: ~10 s (~0.6m tokens/s)

to produce 5,807,947 tokens from the same 15k simple english wikipedia articles. Contrary to what one might expect, this does mean that FastBertTokenizer, beeing a managed implementation, outperforms tokenizers. It should be noted though that tokenizers has a much more complete feature set while FastBertTokenizer is specifically optimized for WordPiece/Bert encoding.

The tokenizers repo states Takes less than 20 seconds to tokenize a GB of text on a server's CPU. As 26 MB of text take ~2s on my notebook CPU, 1 GB would take roughly 80 s. I think it makes sense that "a server's CPU" might be 4x as fast as my notebook's CPU and thus think my results seem plausible. It is however also possible that I unintentionally handicapped tokenizers somehow. Please let me know if you think so!

vs. BERTTokenizers

BERTTokenizers v1.2.0
.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2
Workload: Prefixes of the contents of 15k simple english wikipedia articles, preprocessed to make them encodable by BERTTokenizers.

Method	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
NMZivkovic_BertTokenizers	2,576.0 ms	15.49 ms	13.73 ms	968000.0000	40000.0000	1000.0000	3430.51 MB
FastBertTokenizer_SameDataAsBertTokenizers	229.8 ms	4.55 ms	6.23 ms	-	-	-	1.03 MB

Logo

Created by combining https://icons.getbootstrap.com/icons/cursor-text/ in .NET brand color with https://icons.getbootstrap.com/icons/braces/.

Product	Versions Compatible and additional computed target framework versions.
.NET	net5.0 net5.0 was computed. net5.0-windows net5.0-windows was computed. net6.0 net6.0 is compatible. net6.0-android net6.0-android was computed. net6.0-ios net6.0-ios was computed. net6.0-maccatalyst net6.0-maccatalyst was computed. net6.0-macos net6.0-macos was computed. net6.0-tvos net6.0-tvos was computed. net6.0-windows net6.0-windows was computed. net7.0 net7.0 was computed. net7.0-android net7.0-android was computed. net7.0-ios net7.0-ios was computed. net7.0-maccatalyst net7.0-maccatalyst was computed. net7.0-macos net7.0-macos was computed. net7.0-tvos net7.0-tvos was computed. net7.0-windows net7.0-windows was computed. net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.
.NET Core	netcoreapp2.0 netcoreapp2.0 was computed. netcoreapp2.1 netcoreapp2.1 was computed. netcoreapp2.2 netcoreapp2.2 was computed. netcoreapp3.0 netcoreapp3.0 was computed. netcoreapp3.1 netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 netstandard2.0 is compatible. netstandard2.1 netstandard2.1 was computed.
.NET Framework	net461 net461 was computed. net462 net462 was computed. net463 net463 was computed. net47 net47 was computed. net471 net471 was computed. net472 net472 was computed. net48 net48 was computed. net481 net481 was computed.
MonoAndroid	monoandroid monoandroid was computed.
MonoMac	monomac monomac was computed.
MonoTouch	monotouch monotouch was computed.
Tizen	tizen40 tizen40 was computed. tizen60 tizen60 was computed.
Xamarin.iOS	xamarinios xamarinios was computed.
Xamarin.Mac	xamarinmac xamarinmac was computed.
Xamarin.TVOS	xamarintvos xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos xamarinwatchos was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETStandard 2.0
- System.Memory (>= 4.5.5)
- System.Text.Json (>= 8.0.3)
net6.0
- System.Text.Json (>= 8.0.3)
net8.0
- No dependencies.

NuGet packages (9)

Showing the top 5 NuGet packages that depend on FastBertTokenizer:

Package	Downloads
Microsoft.SemanticKernel.Connectors.Onnx Semantic Kernel connectors for the ONNX runtime. Contains clients for text embedding generation.
SmartComponents.LocalEmbeddings Experimental, end-to-end AI features for .NET apps. Docs and info at https://github.com/dotnet-smartcomponents/smartcomponents
McpEngramMemory.Core Cognitive engram memory engine with semantic search, knowledge graphs, clustering, lifecycle management, and hierarchical expert routing (HMoE). Core library for MCP Engram Memory.
Invarix.Guard Plug and play AI safety middleware for .NET. Ships with ML models out of the box: prompt injection (DeBERTa v3), PII NER (XLM-RoBERTa), multilingual toxicity (DistilBERT, 104 languages), and semantic harm-category classification (multilingual e5-base embeddings, 50+ languages, 19 categories). Works as ASP.NET Core middleware or standalone.
ADCenterSpain.Infrastructure.AI Common classes for AI development

GitHub repositories (1)

Showing the top 1 popular GitHub repositories that depend on FastBertTokenizer:

Repository	Stars
microsoft/semantic-kernel Integrate cutting-edge LLM technology quickly and easily into your apps

Version	Downloads	Last Updated
1.1.30-alpha	3,438	3/3/2025
1.0.28	642,506	4/30/2024
0.5.18-alpha	1,411	12/21/2023
0.4.67	220,326	12/11/2023
0.3.29	567	9/18/2023
0.2.7	647	9/14/2023

https://github.com/georg-jung/FastBertTokenizer/releases/tag/v1.0.28

URL: https://www.nuget.org/packages/FastBertTokenizer/

⇱ NuGet Gallery | FastBertTokenizer 1.0.28

👁 Image
FastBertTokenizer 1.0.28

FastBertTokenizer

Goals

Getting Started

Comparison to BERTTokenizers

Speed / Benchmarks

on NET 6.0 vs. on NET 8.0

vs. SharpToken

vs. HuggingFace tokenizers (Rust)

vs. BERTTokenizers

Logo

.NETStandard 2.0

net6.0

net8.0

NuGet packages (9)

GitHub repositories (1)

URL: https://www.nuget.org/packages/FastBertTokenizer/

⇱ NuGet Gallery | FastBertTokenizer 1.0.28

👁 Image FastBertTokenizer 1.0.28

FastBertTokenizer

Goals

Getting Started

Comparison to BERTTokenizers

Speed / Benchmarks

on NET 6.0 vs. on NET 8.0

vs. SharpToken

vs. HuggingFace tokenizers (Rust)

vs. BERTTokenizers

Logo

.NETStandard 2.0

net6.0

net8.0

NuGet packages (9)

GitHub repositories (1)

👁 Image
FastBertTokenizer 1.0.28