Voozh

What is Docling ???

Docling is an open source document processing library that converts various document formats into structured outputs.
Docling plays an important part in the RAG pipeline.

I'll be taking you through the process of parsing PDFs into structured formats.

Step 1: Set up

Create the project structure in your terminal;

mkdir docling_cli
cd docling_cli

Create your virtual environment and activate it. Fedora

👁

Windows

👁

Step 2: Installing docling

pip install docling
docling --version

Fedora

👁

Windows

👁

Check the docling's version

👁

Step 3: Creating input and outputs folders

create a folder called data where you will stored your desired pdfs.
create a new folder and name it outputs then inside the folders create new folders called; markdown outputs, html outputs and json outputs.

👁

step 4: Default options.

Start by running default options
run;
Changes pdf into markdown format.

docling your-pdf

👁

Step 4: Changing the pdfs into html format

docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs

👁

Step 5: Changing the pdfs into other formats

1. Markdown

docling --to md *.pdf --output ~Documents/docling_cli/outputs/markdown_outputs

👁

2. Json

docling --to json *.pdf --output ~Documents/docling_cli/outputs/json_outputs

👁

3. Plain text

docling --to text *.pdf --output ~Documents/docling_cli/outputs/plaintext_outputs

👁

4. yaml

docling --to yaml *.pdf --output ~Documents/docling_cli/outputs/yaml_outputs

👁

5. html_split_page

docling --to html_split_page *.pdf --output ~Documents/docling_cli/outputs/html_split_page_outputs

👁

6. DOCtags

docling --to doctags *.pdf --output ~Documents/docling_cli/outputs/doctags_outputs

👁

7. vtt

docling --to vtt *.pdf --output ~Documents/docling_cli/outputs/vtt_outputs

👁

Step 6: Analyzing the result findings.

I used three types of pdfss;
one with tables, the other with text and images and the other had tables and paragraphs. Here are my key findings;

1. Pdf with tables

In HTML, the rows and columns came out better than they were in the original pdf.
Markdown outputs were good too as it wrote the tables in markdown format without losing anything.
JSON was broke everything down into nested objects
Plain text was good too but not as compared to markdown.

2. Pdf with text and images

HTML lost the color of the images.

3. Pdf with tables and paragraphs

Paragraphs in all formats came out nicely as texts.

URL: https://dev.to/njeri_kimaru/docling-cli-to-parse-pdfs-and-export-it-to-multiple-formats-3cgc

⇱ Docling CLI to parse PDFs and export it to multiple formats - DEV Community