Text Generation • 3B • Updated • 4.23k • 38
Dataset Viewer (First 5GB)
This is a cleaner version of Github-code dataset, we add the following filters:
- Average line length < 100
- Alpha numeric characters fraction > 0.25
- Remove auto-generated files (keyword search)
3.39M files are removed making up 2.94% of the dataset.
- Downloads last month
- 16,299
