![]() |
VOOZH | about |
dotnet add package FieldCure.DocumentParsers --version 2.0.1
NuGet\Install-Package FieldCure.DocumentParsers -Version 2.0.1
<PackageReference Include="FieldCure.DocumentParsers" Version="2.0.1" />
<PackageVersion Include="FieldCure.DocumentParsers" Version="2.0.1" />Directory.Packages.props
<PackageReference Include="FieldCure.DocumentParsers" />Project file
paket add FieldCure.DocumentParsers --version 2.0.1
#r "nuget: FieldCure.DocumentParsers, 2.0.1"
#:package FieldCure.DocumentParsers@2.0.1
#addin nuget:?package=FieldCure.DocumentParsers&version=2.0.1Install as a Cake Addin
#tool nuget:?package=FieldCure.DocumentParsers&version=2.0.1Install as a Cake Tool
Lightweight document text extraction for .NET — DOCX, HWPX, XLSX, PPTX, HTML, and PDF. Structured Markdown output for LLM / RAG consumption. Pure managed, no native binaries.
## Page {n} headers (auto-registered)m:oMath) and HWPX (hp:equation) converted to [math: LaTeX]title, author, created, modified, subject, keywords, description)[^N] / [^enN] inline references with definition sections> **[Comment — author]:** formatDocumentParserFactory.GetParser(".docx") returns the right parsernet8.0 and net10.0, no native binaries, no Windows-specific APIsIDocumentParser and call DocumentParserFactory.Register()dotnet add package FieldCure.DocumentParsers
using FieldCure.DocumentParsers;
// Auto-detect parser by extension — PDF is registered out of the box.
var parser = DocumentParserFactory.GetParser(".pdf");
if (parser is not null)
{
var bytes = File.ReadAllBytes("report.pdf");
var text = parser.ExtractText(bytes);
Console.WriteLine(text);
}
// Check all supported extensions
foreach (var ext in DocumentParserFactory.SupportedExtensions)
Console.WriteLine(ext); // .docx, .hwpx, .xlsx, .pptx, .html, .htm, .pdf
// Opt-out control for metadata, footnotes, etc.
var docxParser = new DocxParser();
var options = new ExtractionOptions
{
IncludeMetadata = false,
IncludeFootnotes = false
};
var text = docxParser.ExtractText(File.ReadAllBytes("report.docx"), options);
Headings are prefixed with # markers. Tables are rendered as markdown.
Documents with metadata include YAML front matter; footnotes/endnotes are rendered as reference-style links:
---
title: 2026 Business Plan
author: Alice
created: 2026-04-01
---
> **[Header]:** Company Confidential
# 2026 Business Plan
Please refer to the table below[^1] for details.
| Category | Q1 | Q2 |
| --- | --- | --- |
| Revenue | 100 | 150 |
| Cost | 80 | 90 |
> **[Footer]:** Page 1
## Footnotes
[^1]: Source: internal finance report.
Pipe characters inside cells are escaped as \| to preserve table structure.
Use ExtractionOptions to selectively disable metadata, footnotes, comments, or headers/footers.
| Format | Extension | Parser | Description |
|---|---|---|---|
| Word | .docx |
DocxParser |
OpenXML (Office 2007+) |
| Hangul | .hwpx |
HwpxParser |
OWPML (Hancom Office) |
| Excel | .xlsx |
XlsxParser |
OpenXML spreadsheets |
| PowerPoint | .pptx |
PptxParser |
OpenXML presentations |
| HTML | .html, .htm |
HtmlParser |
SmartReader + ReverseMarkdown |
.pdf |
PdfParser |
PdfPig (pure managed, text only) |
PDF text extraction is built in. Two sibling packages add extra PDF capabilities that require native binaries:
PdfImageRenderer : IMediaDocumentParser (PDFium via PDFtoImage).OcrPdfParser + TesseractOcrEngine (English + Korean tessdata).// Imaging (page → PNG)
using FieldCure.DocumentParsers.Imaging;
DocumentParserFactoryImagingExtensions.AddImagingSupport();
// OCR (scanned PDFs)
using FieldCure.DocumentParsers.Ocr;
using var ocr = DocumentParserFactoryOcrExtensions.AddOcrSupport();
MIT — Copyright (c) 2026 FieldCure Co., Ltd.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 is compatible. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
Showing the top 4 NuGet packages that depend on FieldCure.DocumentParsers:
| Package | Downloads |
|---|---|
|
FieldCure.Ai.Providers
AI provider clients for Claude, OpenAI, Gemini, Ollama, and Groq. Shared models and streaming support. |
|
|
FieldCure.DocumentParsers.Audio
Audio transcription parser for FieldCure.DocumentParsers. Converts MP3, WAV, M4A, OGG, FLAC, and WebM audio into timestamped Markdown transcripts via Whisper.net. |
|
|
FieldCure.DocumentParsers.Pdf
PDF text extraction and page image rendering for FieldCure.DocumentParsers |
|
|
FieldCure.DocumentParsers.Imaging
PDF page image rendering for FieldCure.DocumentParsers via PDFtoImage (PDFium). Adds IMediaDocumentParser capability to the core PDF parser. |
This package is not used by any popular GitHub repositories.
v2.0.1 — `ExtractionOptions` is no longer sealed so downstream parser packages (e.g. FieldCure.DocumentParsers.Audio) can subclass it. Adds `ExtractionOptions.SourceExtension` so callers can hint the source format and let parsers skip probing.