TextDiffuser-MARIO-10M

Dataset description

MARIO-10M is a dataset containing about 10 million text images, which includes a variety of sources such as book covers, posters, and tickets. Alongside the images, the dataset also provides OCR results and caption information.

Download

The downloading process include three steps:

[1] Download all the tar files

for i in {0..500};
do wget -O $i.tar.gz https://huggingface.co/datasets/JingyeChen22/TextDiffuser-MARIO-10M/resolve/main/$i.tar.gz?download=true;
done

[2] Unzip the top-level directory

for i in {0..500};
do tar -xvf $i.tar.gz --strip-components=5 && rm $i.tar.gz;
done

[3] Unzip the second-level directory

for i in {0..500};
do
 cd $i && for file in *.tar.gz; do tar -xvf "$file" --strip-components=5 && rm $file; done;
 cd ..;
done

Finally, the directory tree should show like this:

MARIO-10M/
│
├── 0/
│ ├── 00000/
│ ├──── 000000012/
│ ├──────── caption.txt
│ ├──────── charseg.npy
│ ├──────── image.jpg
│ ├──────── info.json
│ ├──────── ocr.txt
...

Citation

If you find MARIO dataset useful in your research, please cite the following paper:

@article{chen2024textdiffuser,
 title={Textdiffuser: Diffusion models as text painters},
 author={Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu},
 journal={Advances in Neural Information Processing Systems},
 volume={36},
 year={2024}
}

@article{chen2023textdiffuser,
 title={Textdiffuser-2: Unleashing the power of language models for text rendering},
 author={Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu},
 journal={European Conference on Computer Vision},
 year={2024}
}

License

Microsoft Open Source Code of Conduct

Downloads last month: 165

URL: https://huggingface.co/datasets/JingyeChen22/TextDiffuser-MARIO-10M

⇱ JingyeChen22/TextDiffuser-MARIO-10M · Datasets at Hugging Face

TextDiffuser-MARIO-10M

Dataset description

Download

Citation

License