Donut (base-sized model, pre-trained only)

Donut model pre-trained-only. It was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository.

Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

👁 model image

Intended uses & limitations

This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

We refer to the documentation which includes code examples.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2111-15664,
 author = {Geewook Kim and
 Teakgyu Hong and
 Moonbin Yim and
 Jinyoung Park and
 Jinyeong Yim and
 Wonseok Hwang and
 Sangdoo Yun and
 Dongyoon Han and
 Seunghyun Park},
 title = {Donut: Document Understanding Transformer without {OCR}},
 journal = {CoRR},
 volume = {abs/2111.15664},
 year = {2021},
 url = {https://arxiv.org/abs/2111.15664},
 eprinttype = {arXiv},
 eprint = {2111.15664},
 timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
 biburl = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}

Downloads last month: 178,754

Model tree for naver-clova-ix/donut-base

Adapters

1 model

Finetunes

485 models

Spaces using naver-clova-ix/donut-base 42

Paper for naver-clova-ix/donut-base

Paper • 2111.15664 • Published Nov 30, 2021 • 6

URL: https://huggingface.co/naver-clova-ix/donut-base

⇱ naver-clova-ix/donut-base · Hugging Face