VOOZH about

URL: https://huggingface.co/datasets/codeparrot/github-code-clean

⇱ codeparrot/github-code-clean · Datasets at Hugging Face


Dataset Viewer (First 5GB)
Duplicate

This is a cleaner version of Github-code dataset, we add the following filters:

  • Average line length < 100
  • Alpha numeric characters fraction > 0.25
  • Remove auto-generated files (keyword search)

3.39M files are removed making up 2.94% of the dataset.

Downloads last month
16,299

Models trained or fine-tuned on codeparrot/github-code-clean

Spaces using codeparrot/github-code-clean 4