![]() |
VOOZH | about |
dotnet add package FastBertTokenizer --version 1.0.28
NuGet\Install-Package FastBertTokenizer -Version 1.0.28
<PackageReference Include="FastBertTokenizer" Version="1.0.28" />
<PackageVersion Include="FastBertTokenizer" Version="1.0.28" />Directory.Packages.props
<PackageReference Include="FastBertTokenizer" />Project file
paket add FastBertTokenizer --version 1.0.28
#r "nuget: FastBertTokenizer, 1.0.28"
#:package FastBertTokenizer@1.0.28
#addin nuget:?package=FastBertTokenizer&version=1.0.28Install as a Cake Addin
#tool nuget:?package=FastBertTokenizer&version=1.0.28Install as a Cake Tool
👁 NuGet version (FastBertTokenizer)
👁 .NET Build
👁 codecov
A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks. Native AOT compatible and support for netstandard2.0.
AutoTokenizer's in all practical cases.dotnet new console
dotnet add package FastBertTokenizer
using FastBertTokenizer;
var tok = new BertTokenizer();
await tok.LoadFromHuggingFaceAsync("bert-base-uncased");
var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet.");
Console.WriteLine(string.Join(", ", inputIds.ToArray()));
var decoded = tok.Decode(inputIds.Span);
Console.WriteLine(decoded);
// Output:
// 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102
// [CLS] lorem ipsum dolor sit amet. [SEP]
Note that while BERTTokenizers handles token type incorrectly, it does support input of two pieces of text that are tokenized with a separator in between. FastBertTokenizer currently does not support this.
tl;dr: FastBertTokenizer can encode 1 GB of text in around 2 s on a typical notebook CPU from 2020.
All benchmarks were performed on a typical end user notebook, a ThinkPad T14s Gen 1:
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3527/23H2/2023Update/SunValley3)
AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.204
Similar results can also be observed using GitHub Actions. Note that using shared CI runners for benchmarking has drawbacks and can lead to varying results though.
.NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2 vs .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2| Method | Runtime | Mean | Error | StdDev | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|
| Singlethreaded | .NET 6.0 | 450.39 ms | 7.340 ms | 6.866 ms | 1.00 | - | - | - | 2 MB | 1.00 |
| MultithreadedMemReuseBatched | .NET 6.0 | 72.46 ms | 1.337 ms | 1.251 ms | 0.16 | 750.0000 | 250.0000 | 250.0000 | 12.75 MB | 6.39 |
| Singlethreaded | .NET 8.0 | 332.51 ms | 6.574 ms | 7.826 ms | 1.00 | - | - | - | 1.99 MB | 1.00 |
| MultithreadedMemReuseBatched | .NET 8.0 | 50.83 ms | 0.999 ms | 1.995 ms | 0.15 | 500.0000 | - | - | 12.75 MB | 6.40 |
SharpToken v2.0.2.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2This isn't an apples to apples comparison as BPE (what SharpToken does) and WordPiece encoding (what FastBertTokenizer does) are different tasks/algorithms. Both were applied to exactly the same texts/corpus though.
| Method | Mean | Error | StdDev | Gen0 | Gen1 | Allocated |
|---|---|---|---|---|---|---|
| SharpTokenFullArticles | 1,551.9 ms | 25.82 ms | 24.15 ms | 5000.0000 | 2000.0000 | 32.56 MB |
| FastBertTokenizerFullArticles | 620.3 ms | 7.00 ms | 6.21 ms | - | - | 2.26 MB |
tokenizers v0.19.1
I'm not really experienced in benchmarking rust code, but my attempts using criterion.rs (see src/HuggingfaceTokenizer/BenchRust) suggest that it takes tokenizers around
to produce 5,807,947 tokens from the same 15k simple english wikipedia articles. Contrary to what one might expect, this does mean that FastBertTokenizer, beeing a managed implementation, outperforms tokenizers. It should be noted though that tokenizers has a much more complete feature set while FastBertTokenizer is specifically optimized for WordPiece/Bert encoding.
The tokenizers repo states Takes less than 20 seconds to tokenize a GB of text on a server's CPU. As 26 MB of text take ~2s on my notebook CPU, 1 GB would take roughly 80 s. I think it makes sense that "a server's CPU" might be 4x as fast as my notebook's CPU and thus think my results seem plausible. It is however also possible that I unintentionally handicapped tokenizers somehow. Please let me know if you think so!
BERTTokenizers v1.2.0.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2| Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated |
|---|---|---|---|---|---|---|---|
| NMZivkovic_BertTokenizers | 2,576.0 ms | 15.49 ms | 13.73 ms | 968000.0000 | 40000.0000 | 1000.0000 | 3430.51 MB |
| FastBertTokenizer_SameDataAsBertTokenizers | 229.8 ms | 4.55 ms | 6.23 ms | - | - | - | 1.03 MB |
Created by combining https://icons.getbootstrap.com/icons/cursor-text/ in .NET brand color with https://icons.getbootstrap.com/icons/braces/.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net5.0 net5.0 was computed. net5.0-windows net5.0-windows was computed. net6.0 net6.0 is compatible. net6.0-android net6.0-android was computed. net6.0-ios net6.0-ios was computed. net6.0-maccatalyst net6.0-maccatalyst was computed. net6.0-macos net6.0-macos was computed. net6.0-tvos net6.0-tvos was computed. net6.0-windows net6.0-windows was computed. net7.0 net7.0 was computed. net7.0-android net7.0-android was computed. net7.0-ios net7.0-ios was computed. net7.0-maccatalyst net7.0-maccatalyst was computed. net7.0-macos net7.0-macos was computed. net7.0-tvos net7.0-tvos was computed. net7.0-windows net7.0-windows was computed. net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
| .NET Core | netcoreapp2.0 netcoreapp2.0 was computed. netcoreapp2.1 netcoreapp2.1 was computed. netcoreapp2.2 netcoreapp2.2 was computed. netcoreapp3.0 netcoreapp3.0 was computed. netcoreapp3.1 netcoreapp3.1 was computed. |
| .NET Standard | netstandard2.0 netstandard2.0 is compatible. netstandard2.1 netstandard2.1 was computed. |
| .NET Framework | net461 net461 was computed. net462 net462 was computed. net463 net463 was computed. net47 net47 was computed. net471 net471 was computed. net472 net472 was computed. net48 net48 was computed. net481 net481 was computed. |
| MonoAndroid | monoandroid monoandroid was computed. |
| MonoMac | monomac monomac was computed. |
| MonoTouch | monotouch monotouch was computed. |
| Tizen | tizen40 tizen40 was computed. tizen60 tizen60 was computed. |
| Xamarin.iOS | xamarinios xamarinios was computed. |
| Xamarin.Mac | xamarinmac xamarinmac was computed. |
| Xamarin.TVOS | xamarintvos xamarintvos was computed. |
| Xamarin.WatchOS | xamarinwatchos xamarinwatchos was computed. |
Showing the top 5 NuGet packages that depend on FastBertTokenizer:
| Package | Downloads |
|---|---|
|
Microsoft.SemanticKernel.Connectors.Onnx
Semantic Kernel connectors for the ONNX runtime. Contains clients for text embedding generation. |
|
|
SmartComponents.LocalEmbeddings
Experimental, end-to-end AI features for .NET apps. Docs and info at https://github.com/dotnet-smartcomponents/smartcomponents |
|
|
McpEngramMemory.Core
Cognitive engram memory engine with semantic search, knowledge graphs, clustering, lifecycle management, and hierarchical expert routing (HMoE). Core library for MCP Engram Memory. |
|
|
Invarix.Guard
Plug and play AI safety middleware for .NET. Ships with ML models out of the box: prompt injection (DeBERTa v3), PII NER (XLM-RoBERTa), multilingual toxicity (DistilBERT, 104 languages), and semantic harm-category classification (multilingual e5-base embeddings, 50+ languages, 19 categories). Works as ASP.NET Core middleware or standalone. |
|
|
ADCenterSpain.Infrastructure.AI
Common classes for AI development |
Showing the top 1 popular GitHub repositories that depend on FastBertTokenizer:
| Repository | Stars |
|---|---|
|
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
|
| Version | Downloads | Last Updated |
|---|---|---|
| 1.1.30-alpha | 3,438 | 3/3/2025 |
| 1.0.28 | 642,506 | 4/30/2024 |
| 0.5.18-alpha | 1,411 | 12/21/2023 |
| 0.4.67 | 220,326 | 12/11/2023 |
| 0.3.29 | 567 | 9/18/2023 |
| 0.2.7 | 647 | 9/14/2023 |