The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider
removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library). If this is not possible, please open a discussion for direct help.
Dataset Card for The Pile
This model card is a work in progress. Please also see our datasheet for more detailed info.
Dataset Summary
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
This dataset is in English (EN).
Dataset Structure
Data Instances
all
{
'meta': {'pile_set_name': 'Pile-CC'},
'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...'
}
Data Fields
all
text(str): Text.meta(dict): Metadata of the data instance with keys:- pile_set_name: Name of the subset.
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
This dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper.
Licensing Information
Please refer to the specific license depending on the subset you use:
- PubMed Central: MIT License
Citation Information
@article{gao2020pile,
title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
}
@article{biderman2022datasheet,
title={Datasheet for the pile},
author={Biderman, Stella and Bicheno, Kieran and Gao, Leo},
journal={arXiv preprint arXiv:2201.07311},
year={2022}
}
Contributions
Thanks to @github-username for adding this dataset.
- Downloads last month
- 5,109
