VOOZH about

URL: https://dev.to/njeri_kimaru/docling-cli-to-parse-pdfs-and-export-it-to-multiple-formats-3cgc

⇱ Docling CLI to parse PDFs and export it to multiple formats - DEV Community


What is Docling ???

Docling is an open source document processing library that converts various document formats into structured outputs.
Docling plays an important part in the RAG pipeline.

I'll be taking you through the process of parsing PDFs into structured formats.

Step 1: Set up

  • Create the project structure in your terminal;
mkdir docling_cli
cd docling_cli
  • Create your virtual environment and activate it. Fedora

👁

Windows

👁

Step 2: Installing docling

pip install docling
docling --version

Fedora

👁

Windows

👁

Check the docling's version

👁

Step 3: Creating input and outputs folders

  • create a folder called data where you will stored your desired pdfs.
  • create a new folder and name it outputs then inside the folders create new folders called; markdown outputs, html outputs and json outputs.

👁

step 4: Default options.

Start by running default options
run;
Changes pdf into markdown format.

docling your-pdf

👁

👁

👁

Step 4: Changing the pdfs into html format

docling --to html *.pdf --output ~Documents/docling_cli/outputs/html_outputs

👁

Step 5: Changing the pdfs into other formats

1. Markdown

docling --to md *.pdf --output ~Documents/docling_cli/outputs/markdown_outputs

👁

2. Json

docling --to json *.pdf --output ~Documents/docling_cli/outputs/json_outputs

👁

3. Plain text

docling --to text *.pdf --output ~Documents/docling_cli/outputs/plaintext_outputs

👁

4. yaml

docling --to yaml *.pdf --output ~Documents/docling_cli/outputs/yaml_outputs

👁

5. html_split_page

docling --to html_split_page *.pdf --output ~Documents/docling_cli/outputs/html_split_page_outputs

👁

6. DOCtags

docling --to doctags *.pdf --output ~Documents/docling_cli/outputs/doctags_outputs

👁

7. vtt

docling --to vtt *.pdf --output ~Documents/docling_cli/outputs/vtt_outputs

👁

Step 6: Analyzing the result findings.

I used three types of pdfss;
one with tables, the other with text and images and the other had tables and paragraphs. Here are my key findings;

1. Pdf with tables

  • In HTML, the rows and columns came out better than they were in the original pdf.
  • Markdown outputs were good too as it wrote the tables in markdown format without losing anything.
  • JSON was broke everything down into nested objects
  • Plain text was good too but not as compared to markdown.

2. Pdf with text and images

  • HTML lost the color of the images.

3. Pdf with tables and paragraphs

  • Paragraphs in all formats came out nicely as texts.