Dataset Viewer

The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library). If this is not possible, please open a discussion for direct help.

Dataset Card for The Pile

This model card is a work in progress. Please also see our datasheet for more detailed info.

Dataset Summary

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

This dataset is in English (EN).

Dataset Structure

Data Instances

all

{
 'meta': {'pile_set_name': 'Pile-CC'},
 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...'
}

Data Fields

all

text (str): Text.
meta (dict): Metadata of the data instance with keys:
- pile_set_name: Name of the subset.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

This dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper.

Licensing Information

Please refer to the specific license depending on the subset you use:

PubMed Central: MIT License

Citation Information

@article{gao2020pile,
 title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
 author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
 journal={arXiv preprint arXiv:2101.00027},
 year={2020}
}


@article{biderman2022datasheet,
 title={Datasheet for the pile},
 author={Biderman, Stella and Bicheno, Kieran and Gao, Leo},
 journal={arXiv preprint arXiv:2201.07311},
 year={2022}
}