![]() |
VOOZH | about |
dotnet add package Encamina.Enmarcha.SemanticKernel.Connectors.Document --version 10.0.5
NuGet\Install-Package Encamina.Enmarcha.SemanticKernel.Connectors.Document -Version 10.0.5
<PackageReference Include="Encamina.Enmarcha.SemanticKernel.Connectors.Document" Version="10.0.5" />
<PackageVersion Include="Encamina.Enmarcha.SemanticKernel.Connectors.Document" Version="10.0.5" />Directory.Packages.props
<PackageReference Include="Encamina.Enmarcha.SemanticKernel.Connectors.Document" />Project file
paket add Encamina.Enmarcha.SemanticKernel.Connectors.Document --version 10.0.5
#r "nuget: Encamina.Enmarcha.SemanticKernel.Connectors.Document, 10.0.5"
#:package Encamina.Enmarcha.SemanticKernel.Connectors.Document@10.0.5
#addin nuget:?package=Encamina.Enmarcha.SemanticKernel.Connectors.Document&version=10.0.5Install as a Cake Addin
#tool nuget:?package=Encamina.Enmarcha.SemanticKernel.Connectors.Document&version=10.0.5Install as a Cake Tool
Document Connectors specializes in reading information from files in various formats and subsequently chunking it. The most typical use case is, within the context of generating document embeddings, reading information from a variety of file formats (pdf, docx, pptx, etc.) and chunks its content into smaller parts.
First, install NuGet. Then, install Encamina.Enmarcha.SemanticKernel.Connectors.Document from the package manager console:
PM> Install-Package Encamina.Enmarcha.SemanticKernel.Connectors.Document
First, install .NET CLI. Then, install Encamina.Enmarcha.SemanticKernel.Connectors.Document from the .NET CLI:
dotnet add package Encamina.Enmarcha.SemanticKernel.Connectors.Document
Starting from a Program.cs or a similar entry point file in your project, add the following code:
// Entry point
var builder = WebApplication.CreateBuilder(new WebApplicationOptions
{
// ...
});
// ...
services.AddDefaultDocumentContentExtractor();
This extension method will add the default implementation of the interface as a singleton. The default implementation is . With this, we can resolve the IDocumentContentExtractor interface and obtain the chunks of a file:
public class MyClass
{
private readonly IDocumentContentExtractor documentContentExtractor;
public MyClass(IDocumentContentExtractor documentContentExtractor)
{
this.documentContentExtractor = documentContentExtractor;
}
public IEnumerable<string> GetPdfChunks()
{
using var file = File.OpenRead("example.pdf");
var pdfChunks = documentContentExtractor.GetDocumentContent(file, ".pdf");
return pdfChunks;
}
}
var serviceProvider = services.BuildServiceProvider();
var documentContentExtractor = serviceProvider.GetRequiredService<IDocumentContentExtractor>();
using var file = File.OpenRead("example.pdf");
var fileChunks = documentContentExtractor.GetDocumentContent(file, ".pdf");
For the above code to be fully functional, it is necessary to configure some additional services, specifically the interface and a .
The previous code, based on the file extension, searches for a suitable IDocumentConnector for the file type, processes the file to extract its text and finally, it uses an ITextSplitter to split the text into chunks.
IDocumentConnectorThe default implementation DefaultDocumentContentExtractor, uses the following IDocumentConnectors:
WordDocumentConnector: For .docx files, it extracts the text from the file by adding each paragraph on a new line.
: For .pdf files, it extracts the raw text from the file (with all words separated by spaces) and removes common words, typically headers or footers that appear in at least 25% of the document.
: For .pptx files, it extracts the text from the file, with one line per paragraph found in each slide.
: For .txt files, it extracts the raw text from the file using UTF-8 as the character encoding.
: For .md files, it extracts the raw text from the file using UTF-8 as the character encoding.
: For .vtt files, it extracts the text from the subtitles while removing the timestamp marks. Use UTF-8 as the character encoding.
For other formats, it throws a NotSupportedException.
IDocumentConnector: For .pptx files, it extracts the text from the file with just one line for each slide found.
: For .pdf files, it extracts the raw text from the file for each page (all words separated by spaces) and add a line break between the text of each page.
: For .pdf files, it retrieve the Table of Contents and generates, for each Table of Contents item, a text with the section title, a colon mark (:), and the content text of the section (e.g. Title1: Content of the Title1 section). Add a line break between each section. The output format of the text is configurable with the TocItemFormat property. Additionally, remove common words, typically headers or footers that appear in at least 25% of the document.
: For .pdf files, it extracts the text from the file and attempts to preserve the document's formatting, including paragraphs, titles, and other structural elements. Additionally, it removes common words, typically headers or footers that appear in at least 25% of the document, and it excludes non-horizontal text. During the text extraction process, an effort is made to retain the document's format; however, it is important to note that this process relies on OCR recognition, which is not perfect, and the results may vary depending on the quality of the PDF.
IDocumentConnectorTo use your own IDocumentConnectors, you can use the base class and override the GetDocumentConnector method. This way, you can return your own IDocumentConnectors to handle a specific file format based on the file extension.
public class MyCustomDocumentContentExtractor : DocumentContentExtractorBase
{
public MyCustomDocumentContentExtractor(ITextSplitter textSplitter, Func<string, int> lengthFunction) : base(textSplitter, lengthFunction)
{
}
protected override IDocumentConnector GetDocumentConnector(string fileExtension)
{
return fileExtension.ToUpperInvariant() switch
{
@".rtf" => new MyCustomRtfDocumentConnector(),
@".pdf" => new PdfWithTocDocumentConnector(),
@".txt" => new TxtDocumentConnector(Encoding.UTF8),
_ => throw new NotSupportedException(fileExtension),
};
}
}
Don't forget to register it.
// Entry point
var builder = WebApplication.CreateBuilder(new WebApplicationOptions
{
// ...
});
// ...
// Now we use our own implementation
// services.AddDefaultDocumentContentExtractor();
services.AddSingleton<IDocumentContentExtractor, MyCustomDocumentContentExtractor>();
With this, you will be able to use the extractor you need for each type of file.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 net10.0 is compatible. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
This package is not used by any NuGet packages.
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 10.0.5 | 119 | 6/1/2026 |
| 10.0.4 | 576 | 4/8/2026 |
| 10.0.3 | 252 | 4/6/2026 |
| 10.0.2 | 491 | 12/17/2025 |
| 10.0.1 | 307 | 12/17/2025 |
| 10.0.0 | 313 | 12/16/2025 |
| 10.0.0-preview-09 | 431 | 11/19/2025 |
| 10.0.0-preview-08 | 446 | 11/18/2025 |
| 10.0.0-preview-07 | 711 | 10/22/2025 |
| 10.0.0-preview-06 | 321 | 10/14/2025 |
| 10.0.0-preview-05 | 211 | 10/8/2025 |
| 10.0.0-preview-04 | 209 | 10/7/2025 |
| 10.0.0-preview-03 | 343 | 9/16/2025 |
| 10.0.0-preview-02 | 342 | 9/16/2025 |
| 8.3.0 | 512 | 9/10/2025 |
| 8.3.0-preview-02 | 221 | 9/10/2025 |
| 8.3.0-preview-01 | 223 | 9/8/2025 |
| 8.2.1-preview-08 | 228 | 8/18/2025 |
| 8.2.1-preview-07 | 208 | 8/12/2025 |