SemchunkNet 1.0.3

.NET 8.0 .NET Standard 2.0

dotnet add package SemchunkNet --version 1.0.3

NuGet\Install-Package SemchunkNet -Version 1.0.3

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="SemchunkNet" Version="1.0.3" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="SemchunkNet" Version="1.0.3" />
 

 Directory.Packages.props

<PackageReference Include="SemchunkNet" />
 

 Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add SemchunkNet --version 1.0.3

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: SemchunkNet, 1.0.3"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package SemchunkNet@1.0.3

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=SemchunkNet&version=1.0.3
 

 Install as a Cake Addin

#tool nuget:?package=SemchunkNet&version=1.0.3
 

 Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

👁 Build & Tests
👁 Publish NuGet Packages
👁 NuGet Version

Semchunk.Net 🧩

Semchunk.Net is a C#/.NET port of the original semchunk library by Isaacus (Python).
All credit for the algorithm and design goes to the original author; this project re-implements it for the .NET ecosystem.

Semchunk.Net is a fast, lightweight, easy-to-use library for splitting text into semantically meaningful chunks in .NET.

Works with any token counter (Func<string,int>)
Optional plug-and-play tokenizers via SemchunkNet.Tiktoken and SemchunkNet.MicrosoftML
Supports overlapping chunks and character offsets back into the original text
Uses the same recursive, semantics-aware algorithm as the Python version

The goal is a faithful port of semchunk’s behaviour, with a .NET-idiomatic API.

Installation 📦

Core library:

dotnet add package SemchunkNet

Optional tokenizer flavours (each includes a ready-made ITokenizer implementation and pulls the right tokenizer dependency):

dotnet add package SemchunkNet.Tiktoken
dotnet add package SemchunkNet.MicrosoftML

PackageReference examples:

<ItemGroup>
 <PackageReference Include="SemchunkNet" Version="1.0.3" />
 
 <PackageReference Include="SemchunkNet.Tiktoken" Version="1.0.3" />
 <PackageReference Include="SemchunkNet.MicrosoftML" Version="1.0.3" />
</ItemGroup>

Semchunk.Net is designed to target:

netstandard2.0 (broad compatibility: .NET Framework 4.6.1+, .NET Core 2.0+, etc.)
plus a modern TFM (e.g. net8.0) for optimal performance if you multi-target.

You bring your own tokenizer/token counter; Semchunk.Net doesn’t force a specific tokenizer dependency.

Quickstart 👩‍💻

Example 1 – Using the packaged Tiktoken wrapper

using SemchunkNet;
using SemchunkNet.Tiktoken;

var tokenizer = new TiktokenTokenizer(modelName: "gpt-4", modelMaxLength: 8192);
var chunker = ChunkerFactory.Create(tokenizer, chunkSize: 512);

var text = "The quick brown fox jumps over the lazy dog.";

// Basic chunking (no overlap, no offsets):
var chunks = chunker.Chunk(text);

// Chunk with character offsets and 50% overlap:
var chunksWithOffsets = chunker.Chunk(
 text,
 out var offsets,
 overlap: 0.5
);

// Chunk a list of texts:
var manyChunks = chunker.ChunkMany(new[] { text });

Example 2 – Using the Microsoft.ML tokenizer wrapper

using SemchunkNet;
using SemchunkNet.MicrosoftML;

var tokenizer = MicrosoftMLTokenizer.ForTiktokenModel(modelName: "gpt-4", modelMaxLength: 8192);
var chunker = ChunkerFactory.Create(tokenizer, chunkSize: 512);

var text = "The quick brown fox jumps over the lazy dog.";
var chunks = chunker.Chunk(text, overlap: 0.25);

Example 3 – Simple custom token counter If you don’t care about true tokens and just want a quick splitter:

using SemchunkNet;

// Each word = 1 "token"
Func<string, int> wordCounter = s =>
 string.IsNullOrWhiteSpace(s)
 ? 0
 : s.Split((char[])null, StringSplitOptions.RemoveEmptyEntries).Length;

const int chunkSize = 16;

var chunker = ChunkerFactory.Create(wordCounter, chunkSize);

var text = "The quick brown fox jumps over the lazy dog.";

// Non-overlapping chunks:
var chunks = chunker.Chunk(text);
// Overlapping chunks with offsets:
var overlapped = chunker.Chunk(text, out var offsets, overlap: 0.5);

Usage 🕹️

`ChunkerFactory.Create(...)`

This is the main entry point. It mirrors Python’s chunkerify(...).

From a token counter

public static Chunker Create(
 Func<string, int> tokenCounter,
 int chunkSize,
 int? maxTokenChars = null,
 bool memoize = true,
 int? cacheMaxSize = null
)

tokenCounter – function that returns the token count for a string.
chunkSize – max tokens per chunk.
maxTokenChars – optional performance hint: longest token length in characters.

If provided, Semchunk.Net can short-circuit tokenization for very long inputs, just like the Python version.
memoize – whether to memoize the token counter (LRU-style cache).
cacheMaxSize – reserved for future bounded-cache support (currently unbounded, as in Python’s default).

Returns a Chunker instance.

From a tokenizer abstraction

If you define an ITokenizer:

public interface ITokenizer
{
 int[] Encode(string text);
 int ModelMaxLength { get; }
}

You can construct a Chunker directly:

public static Chunker Create(
 ITokenizer tokenizer,
 int? chunkSize = null,
 int? maxTokenChars = null,
 bool memoize = true,
 int? cacheMaxSize = null
)

If chunkSize is null, Semchunk.Net uses tokenizer.ModelMaxLength (analogous to Python’s model_max_length heuristic).

You’re free to implement ITokenizer for:

Tiktoken
Microsoft.ML.Tokenizers / TiktokenTokenizer.CreateForModel("gpt-4")
any other tokenizer you like.

`Chunker`

This is the main object you work with once created.

Chunk a single text

public IReadOnlyList<string> Chunk(
 string text,
 double? overlap = null
);

overlap:
- null → no overlap.
- < 1.0 → treated as a ratio of chunkSize (e.g. 0.2 → 20% overlap).
- = 1.0 → treated as an absolute token count.

Chunk with offsets

public IReadOnlyList<string> Chunk(
 string text,
 out IReadOnlyList<(int Start, int End)> offsets,
 double? overlap = null
);
offsets[i] = (start, end) such that
chunks[i] == text.Substring(start, end - start).

Offsets are character indices into the original string (0-based, end-exclusive).

Chunk multiple texts

public IReadOnlyList<IReadOnlyList<string>> ChunkMany(
 IReadOnlyList<string> texts,
 double? overlap = null
);

public IReadOnlyList<IReadOnlyList<string>> ChunkMany(
 IReadOnlyList<string> texts,
 out IReadOnlyList<IReadOnlyList<(int Start, int End)>> allOffsets,
 double? overlap = null
);

Returns one list of chunks per input text.

In the offsets overload, allOffsets[i] corresponds to offsets for texts[i].

`ChunkerCore.Chunk(...)` (low-level API)

For advanced usage, you can call the algorithm directly:

ChunkResult ChunkerCore.Chunk(
 string text,
 int chunkSize,
 Func<string, int> tokenCounter,
 bool memoize = true,
 bool returnOffsets = false,
 double? overlap = null,
 int? cacheMaxSize = null,
 int recursionDepth = 0,
 int startOffset = 0
);

Mirrors Python’s semchunk.chunk(...).

Returns ChunkResult with:

public readonly struct ChunkResult
{
 public IReadOnlyList<string> Chunks { get; }
 public IReadOnlyList<(int Start, int End)> Offsets { get; }
}

You usually don’t need this unless you’re doing very custom plumbing.

How It Works 🔍

Semchunk.Net implements the same algorithm as the Python version:

Pick the most meaningful splitter
For each text, it chooses the best splitter in this order:
- Largest run of newlines / carriage returns (\n, \r)
- Largest run of tabs (\t)
- Largest run of whitespace (\s); or, if the longest run is a single char and there exists whitespace preceded by one of the punctuation splitters below, those specific whitespace characters
- Sentence terminators: ., ?, !, *
- Clause separators: ;, ,, (, ), [, ], “, ”, ‘, ’, ', ", `
- Sentence interrupters: :, —, …
- Word joiners: /, \, –, &, -
- Fallback: individual characters

Recursive splitting

Text is split by the chosen splitter into pieces.

For any piece whose token count exceeds localChunkSize, Semchunk.Net recursively re-chunks that piece.

Merge underfull splits

Adjacent splits are merged using a binary-search-like heuristic to approximate the target chunk size, using an adaptive tokens/characters ratio.

This continues until each chunk is at or below the desired token limit.

Reattach punctuation

If the splitter is non-whitespace and it makes sense to do so, trailing splitters are attached to the preceding chunk without breaking the token budget.

Otherwise, the splitter becomes its own small chunk with proper offsets.

Strip whitespace-only chunks

After the top-level pass, any chunks that are empty or consist only of whitespace are removed.

Build overlapping windows (optional)

If overlap is set:

The algorithm reduces the effective chunk size internally to min(overlapTokens, chunkSize - overlapTokens) where overlapTokens is:
- floor(chunkSize * overlap) if overlap < 1 (ratio), or
- min(overlap, chunkSize - 1) if overlap >= 1 (absolute tokens).

It first builds non-overlapping subchunks of size localChunkSize.

Then it merges groups of subchunks into overlapping windows, sliding by a stride derived from the non-overlapped portion so that each final chunk overlaps the previous by the specified amount.

The result is a sequence of chunks that respect a token budget but align much better with human sentence/paragraph structure than naive fixed-window or simple recursive character chunkers.

Benchmarks 📊

The original Python semchunk README reports (on a Ryzen 9 7900X, 96 GB RAM, Python 3.12):

~3.0 s to chunk the entire NLTK Gutenberg corpus into 512-token GPT-4 chunks
vs ~24.8 s for semantic-text-splitter under similar conditions

Semchunk.Net includes an analogous benchmark against the same corpus and a GPT-4-style tokenizer (via a .NET tiktoken implementation). The C# version appears to be at least as fast using Tiktoken (tryAGI):

Python version:

Number of texts: 18
semchunk: 2.71s, total chunks: 7390
semantic_text_splitter: 22.05s, total chunks: 7277

C# version:

Number of texts: 18
Semchunk.Net: 1.82s, total chunks: 7390

Licence 📄

This project is licensed under the MIT License, consistent with the original semchunk library.

Please see LICENCE for details. The core algorithm and design are by Isaacus (semchunk, Python); Semchunk.Net is an independent C#/.NET implementation of that work.

Product	Versions Compatible and additional computed target framework versions.
.NET	net5.0 net5.0 was computed. net5.0-windows net5.0-windows was computed. net6.0 net6.0 was computed. net6.0-android net6.0-android was computed. net6.0-ios net6.0-ios was computed. net6.0-maccatalyst net6.0-maccatalyst was computed. net6.0-macos net6.0-macos was computed. net6.0-tvos net6.0-tvos was computed. net6.0-windows net6.0-windows was computed. net7.0 net7.0 was computed. net7.0-android net7.0-android was computed. net7.0-ios net7.0-ios was computed. net7.0-maccatalyst net7.0-maccatalyst was computed. net7.0-macos net7.0-macos was computed. net7.0-tvos net7.0-tvos was computed. net7.0-windows net7.0-windows was computed. net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.
.NET Core	netcoreapp2.0 netcoreapp2.0 was computed. netcoreapp2.1 netcoreapp2.1 was computed. netcoreapp2.2 netcoreapp2.2 was computed. netcoreapp3.0 netcoreapp3.0 was computed. netcoreapp3.1 netcoreapp3.1 was computed.
.NET Standard	netstandard2.0 netstandard2.0 is compatible. netstandard2.1 netstandard2.1 was computed.
.NET Framework	net461 net461 was computed. net462 net462 was computed. net463 net463 was computed. net47 net47 was computed. net471 net471 was computed. net472 net472 was computed. net48 net48 was computed. net481 net481 was computed.
MonoAndroid	monoandroid monoandroid was computed.
MonoMac	monomac monomac was computed.
MonoTouch	monotouch monotouch was computed.
Tizen	tizen40 tizen40 was computed. tizen60 tizen60 was computed.
Xamarin.iOS	xamarinios xamarinios was computed.
Xamarin.Mac	xamarinmac xamarinmac was computed.
Xamarin.TVOS	xamarintvos xamarintvos was computed.
Xamarin.WatchOS	xamarinwatchos xamarinwatchos was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

.NETStandard 2.0
- No dependencies.
net8.0
- No dependencies.

NuGet packages (2)

Showing the top 2 NuGet packages that depend on SemchunkNet:

Package	Downloads
SemchunkNet.MicrosoftML Semchunk.Net tokenizer wrapper for Microsoft.ML.Tokenizers.
SemchunkNet.Tiktoken Semchunk.Net tokenizer wrapper for the Tiktoken package.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.3	1,900	11/25/2025
1.0.2	219	11/24/2025
1.0.1	236	11/24/2025
1.0.0	227	11/24/2025

URL: https://www.nuget.org/packages/SemchunkNet/

⇱ NuGet Gallery | SemchunkNet 1.0.3

SemchunkNet 1.0.3

Semchunk.Net 🧩

Installation 📦

Quickstart 👩‍💻

Usage 🕹️

`ChunkerFactory.Create(...)`

From a token counter

From a tokenizer abstraction

`Chunker`

Chunk a single text

Chunk with offsets

Chunk multiple texts

`ChunkerCore.Chunk(...)` (low-level API)

How It Works 🔍

Recursive splitting

Merge underfull splits

Reattach punctuation

Strip whitespace-only chunks

Build overlapping windows (optional)

Benchmarks 📊

Licence 📄

.NETStandard 2.0

net8.0

NuGet packages (2)

GitHub repositories

URL: https://www.nuget.org/packages/SemchunkNet/

⇱ NuGet Gallery | SemchunkNet 1.0.3

SemchunkNet 1.0.3

Semchunk.Net 🧩

Installation 📦

Quickstart 👩‍💻

Usage 🕹️

ChunkerFactory.Create(...)

From a token counter

From a tokenizer abstraction

Chunker

Chunk a single text

Chunk with offsets

Chunk multiple texts

ChunkerCore.Chunk(...) (low-level API)

How It Works 🔍

Recursive splitting

Merge underfull splits

Reattach punctuation

Strip whitespace-only chunks

Build overlapping windows (optional)

Benchmarks 📊

Licence 📄

.NETStandard 2.0

net8.0

NuGet packages (2)

GitHub repositories

`ChunkerFactory.Create(...)`

`Chunker`

`ChunkerCore.Chunk(...)` (low-level API)