VOOZH about

URL: https://huggingface.co/datasets/EssentialAI/eai-taxonomy-code-w-dclm

⇱ EssentialAI/eai-taxonomy-code-w-dclm · Datasets at Hugging Face


id
int64
text
string
metadata
dict
line_start_n_end_idx
dict
quality_signals
dict
eai_taxonomy
dict
pid
string
-4,041,170,850,086,211,600
OSNews: http://www.osnews.com/story/17169/OpenOffice_org_2_0_RC1_for_OS_2_eComStation Exploring the Future of Computing en-us Copyright 2001-2015, David Adams adam+nospam@osnews.com Wed, 25 Nov 2015 20:51:04 GMT http://www.osnews.com/images/osnews.gif OSNews.com http://www.osnews.com FYI http://www.osnews.com/thread?20...
{ "url": "http://www.osnews.com/story/17169/OpenOffice_org_2_0_RC1_for_OS_2_eComStation/feed", "source_domain": "www.osnews.com", "snapshot_id": "crawl=CC-MAIN-2015-48", "warc_metadata": { "Content-Length": "20263", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:FAA...
{ "line_start_idx": [ 0, 1193, 1194, 1443, 1444, 1637, 1638, 1866, 1867, 2091, 2092, 2218, 2288, 2507, 2508, 2649, 2650, 2729, 2730, 2874, 2951, 3113, 3114, 3249, 3322, 3512, 3513, 3554, 3555, 3620,...
{ "red_pajama_v2": { "ccnet_original_length": 14224, "ccnet_original_nlines": 104, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 3, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3135643005371094, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.0343511514...
{ "free_decimal_correspondence": { "primary": { "code": "004.16", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computers and Computer science" } }, "secondary": { "code": "004.019", "labe...
e3c4dd7183f5f028f56d5a7988cc68c4
8,252,207,289,312,992,000
Posts Tagged ‘nokia lumia 520 vs samsung galaxy s3’ Nokia Lumia 920 vs Samsung Galaxy S3 vs HTC One X September 8th, 2012 The mobile world has changed a lot since Nokia last put out a phone that truly wowed large amounts of people. Its tie in with Microsoft spawned some half decent handsets but despite Nokia’s best ...
{ "url": "http://www.freshersbeat.com/tag/nokia-lumia-520-vs-samsung-galaxy-s3", "source_domain": "www.freshersbeat.com", "snapshot_id": "crawl=CC-MAIN-2013-20", "warc_metadata": { "Content-Length": "28810", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:OHHJHEJGUQY...
{ "line_start_idx": [ 0, 52, 53, 103, 104, 124, 125, 367, 368, 725, 788, 789, 836, 837 ], "line_end_idx": [ 52, 53, 103, 104, 124, 125, 367, 368, 725, 788, 789, 836, 837, 859 ] }
{ "red_pajama_v2": { "ccnet_original_length": 859, "ccnet_original_nlines": 13, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3535911738872528, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.0386740304529...
{ "free_decimal_correspondence": { "primary": { "code": "004.16", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computers and Computer science" } }, "secondary": { "code": "658.85", "label...
e3c4dd7183f5f028f56d5a7988cc68c4
-2,439,845,835,742,470,700
Unlimited Plugins, WordPress themes, videos & courses! Unlimited asset downloads! From $16.50/m Advertisement 1. Code 2. HTML5 HTML5 Mastery: Encoding by Difficulty:IntermediateLength:MediumLanguages: This post is part of a series called HTML5 Mastery Class. HTML5 Mastery: Scoping Rules HTML5 Mastery: Fragments H...
{ "url": "https://code.tutsplus.com/tutorials/html5-mastery-encoding--cms-24841", "source_domain": "code.tutsplus.com", "snapshot_id": "crawl=CC-MAIN-2021-17", "warc_metadata": { "Content-Length": "108873", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:DEYFWD7JHO72...
{ "line_start_idx": [ 0, 96, 110, 120, 131, 132, 156, 157, 160, 207, 265, 294, 319, 333, 334, 544, 545, 876, 877, 933, 934, 981, 1031, 1106, 1107, 1385, 1386, 1414, 1415, 1921, 1922, 2185, ...
{ "red_pajama_v2": { "ccnet_original_length": 12401, "ccnet_original_nlines": 120, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.41379308700561523, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.023384859...
{ "free_decimal_correspondence": { "primary": { "code": "005.1", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "004.678", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
8,217,648,156,204,675,000
Program of bug simulation , JAVA Programming You will be creating a World that consists of ants and doodlebugs. Each time you click the board each bug will do some of the following: move, bread, eat, and starve. Ants will function in a certain way, and doodlebugs in another. This assignment is based on Absolute Java...
{ "url": "http://www.expertsmind.com/questions/program-of-bug-simulation-30135070.aspx", "source_domain": "www.expertsmind.com", "snapshot_id": "crawl=CC-MAIN-2017-09", "warc_metadata": { "Content-Length": "35840", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:M2FM...
{ "line_start_idx": [ 0, 45, 46, 213, 214, 278, 279, 322, 323, 348, 349, 359, 360, 365, 366, 371, 372, 423, 505, 586, 587, 593, 594, 657, 726, 811, 880, 881, 892, 893, 898, 899, 945, ...
{ "red_pajama_v2": { "ccnet_original_length": 5768, "ccnet_original_nlines": 147, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3774940073490143, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.01596168987...
{ "free_decimal_correspondence": { "primary": { "code": "005.1", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "595.79", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
3,872,857,292,200,175,600
Kennwortmanager KeePassX Weiterentwicklung der Version 1 You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long. This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.           keepass...
{ "url": "https://git.piratenpartei-sh.de/thooge/keepassx1/src/branch/master/src/res/docs/quickstart.html", "source_domain": "git.piratenpartei-sh.de", "snapshot_id": "CC-MAIN-2024-18", "warc_metadata": { "Content-Length": "128123", "Content-Type": "application/http; msgtype=response", "WARC-Block-D...
{ "line_start_idx": [ 0, 57, 201, 303, 305, 307, 309, 311, 313, 352, 353, 363, 371, 372, 429, 482, 489, 496, 538, 546, 553, 570, 606, 610, 646, 696, 701, 722, 743, 747, 795, 835, 884, ...
{ "red_pajama_v2": { "ccnet_original_length": 9513, "ccnet_original_nlines": 244, "rps_doc_curly_bracket": 0.0033638200256973505, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.2645992934703827, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_w...
{ "free_decimal_correspondence": { "primary": { "code": "005.822", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "005.82", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
-5,359,282,950,657,512,000
tirsdag 16. februar 2010 Minority Report computer interface designer demos the real thing (video) At the big-think, big-demo TED conference in Long Beach last week, MIT Media Lab alumnus John Underkoffler demonstrated a real working version of the memorable grab-it-and-throw-it computer interface he designed for Tom ...
{ "url": "http://norgenews.blogspot.com/2010/02/minority-report-computer-interface.html", "source_domain": "norgenews.blogspot.com", "snapshot_id": "crawl=CC-MAIN-2018-30", "warc_metadata": { "Content-Length": "57936", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:...
{ "line_start_idx": [ 0, 25, 26, 99, 100, 377, 661, 842, 999, 1142, 1297, 1474, 1786, 2055, 2193, 2251, 2252, 2271, 2272 ], "line_end_idx": [ 25, 26, 99, 100, 377, 661, 842, 999, 1142, 1297, 147...
{ "red_pajama_v2": { "ccnet_original_length": 2293, "ccnet_original_nlines": 18, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3146551847457886, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.010775860399...
{ "free_decimal_correspondence": { "primary": { "code": "004.019", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computers and Computer science" } }, "secondary": { "code": "791.4372", "la...
e3c4dd7183f5f028f56d5a7988cc68c4
-7,295,809,349,029,438,000
Blogs Drone On Well, I finally built me a drone so's I could fit in with all the cool kids. What follows is a short description of my experience with helpful links for someone else who would like to build a substantially similar quad. I built basically the cheapest quadcopter you could use for anything more than just...
{ "url": "http://hyperlogos.org/blog?page=2", "source_domain": "hyperlogos.org", "snapshot_id": "crawl=CC-MAIN-2017-39", "warc_metadata": { "Content-Length": "40860", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:JQJJONCLO5CA3YUCGATR3I34IN6ATAR3", "WARC-Concurr...
{ "line_start_idx": [ 0, 6, 7, 16, 17, 439, 440, 480, 481, 886, 887, 916, 917, 1315, 1316, 1339, 1340, 1862, 1863, 1903, 1904, 2353, 2354, 2390, 2391, 2872, 2873, 2907, 2908, 3047, 3048, 336...
{ "red_pajama_v2": { "ccnet_original_length": 4507, "ccnet_original_nlines": 42, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.47089946269989014, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.05185185000...
{ "free_decimal_correspondence": { "primary": { "code": "004.0285636", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computers and Computer science" } }, "secondary": { "code": "629.117", ...
e3c4dd7183f5f028f56d5a7988cc68c4
5,915,462,743,057,457,000
Take the 2-minute tour × Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required. Say there is such table: mysql> SELECT * FROM tags; +---------+--------+ | post_id | tag_id | +---------+--------+ | 1 | 2 | | 1 | 3 | | ...
{ "url": "http://stackoverflow.com/questions/3083409/mysql-how-to-select-groups-having-certain-values", "source_domain": "stackoverflow.com", "snapshot_id": "crawl=CC-MAIN-2014-10", "warc_metadata": { "Content-Length": "82233", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest...
{ "line_start_idx": [ 0, 25, 157, 158, 183, 184, 211, 232, 253, 274, 295, 316, 337, 358, 379, 400, 425, 426, 726, 727, 755, 767, 768, 778, 779, 808, 809, 885, 886, 902, 913, 945, 963, ...
{ "red_pajama_v2": { "ccnet_original_length": 3896, "ccnet_original_nlines": 130, "rps_doc_curly_bracket": 0.0005133500089868903, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.24231678247451782, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_...
{ "free_decimal_correspondence": { "primary": { "code": "005.44", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "005.1", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
6,129,714,596,294,449,000
IRC log of css on 2011-02-16 Timestamps are in UTC. 16:42:21 [RRSAgent] RRSAgent has joined #css 16:42:21 [RRSAgent] logging to http://www.w3.org/2011/02/16-css-irc 16:42:28 [glazou] Zakim, this will be Style 16:42:28 [Zakim] ok, glazou; I see Style_CSS FP()12:00PM scheduled to start in 18 minutes 16:42:33 [glazou] R...
{ "url": "http://www.w3.org/2011/02/16-css-irc", "source_domain": "www.w3.org", "snapshot_id": "crawl=CC-MAIN-2014-10", "warc_metadata": { "Content-Length": "31968", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:5MOPXULU5JV52YTUR3BOKUXVQWY7QOL6", "WARC-Concurre...
{ "line_start_idx": [ 0, 29, 30, 53, 54, 74, 99, 119, 167, 185, 211, 228, 301, 319, 346, 364, 378, 395, 497, 515, 538, 556, 579, 596, 634, 651, 664, 681, 687, 705, 723, 740, 756, 775...
{ "red_pajama_v2": { "ccnet_original_length": 20895, "ccnet_original_nlines": 557, "rps_doc_curly_bracket": 0.00019142999371979386, "rps_doc_ldnoobw_words": 2, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.2037786841392517, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps...
{ "free_decimal_correspondence": { "primary": { "code": "005.4", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "004.019", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
-9,190,456,498,215,951,000
File: vt_text.sql package info (click to toggle) virtuoso-opensource 6.1.6+dfsg2-4 • links: PTS, VCS • area: main • in suites: bullseye, buster, sid, stretch • size: 260,992 kB • ctags: 125,220 • sloc: ansic: 652,748; sql: 458,419; xml: 282,834; java: 61,031; sh: 40,031; cpp: 36,890; cs: 25,240; php: 12,69...
{ "url": "https://sources.debian.org/src/virtuoso-opensource/6.1.6+dfsg2-4/libsrc/Wi/vt_text.sql/", "source_domain": "sources.debian.org", "snapshot_id": "crawl=CC-MAIN-2020-05", "warc_metadata": { "Content-Length": "69346", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": ...
{ "line_start_idx": [ 0, 18, 19, 50, 84, 104, 119, 165, 186, 205, 427, 526, 528, 530, 532, 534, 536, 538, 540, 542, 544, 547, 550, 553, 556, 559, 562, 565, 568, 571, 574, 577, 580, 5...
{ "red_pajama_v2": { "ccnet_original_length": 20293, "ccnet_original_nlines": 972, "rps_doc_curly_bracket": 0.003843690035864711, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.07587961107492447, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_...
{ "free_decimal_correspondence": { "primary": { "code": "005.746", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "005.1", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
-2,500,094,439,207,875,000
Get the most out of your Centmin Mod LEMP stack Become a Member MariaDB Why does 123.08centos7beta02 my.cnf not include `innodb_buffer_pool_instances` variable? Discussion in 'Nginx, PHP-FPM & MariaDB MySQL' started by jeffwidman, Apr 18, 2015. 1. jeffwidman jeffwidman Active Member 152 27 28 ...
{ "url": "https://community.centminmod.com/threads/why-does-123-08centos7beta02-my-cnf-not-include-innodb_buffer_pool_instances-variable.2796/", "source_domain": "community.centminmod.com", "snapshot_id": "crawl=CC-MAIN-2021-21", "warc_metadata": { "Content-Length": "152249", "Content-Type": "applicatio...
{ "line_start_idx": [ 0, 48, 64, 65, 162, 163, 247, 248, 264, 265, 294, 295, 303, 310, 317, 333, 346, 354, 370, 382, 398, 619, 620, 661, 662, 916, 917, 1285, 1286, 1292, 1305, 1306, 1345...
{ "red_pajama_v2": { "ccnet_original_length": 6332, "ccnet_original_nlines": 171, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3147566616535187, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.04238618910...
{ "free_decimal_correspondence": { "primary": { "code": "005.44", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "005.72", "labels": { ...
e3c4dd7183f5f028f56d5a7988cc68c4
-1,242,514,781,259,357,000
reddit's stories are created by its users join the community, vote, and change the world. learn more › Why is there both an e-mail AND a Gmail app??? by chka in Android [–]dajmeister 0 points1 point  (0 children) sorry, this has been archived and can no longer be voted on Honestly i just pull all my email into gm...
{ "url": "http://www.reddit.com/user/dajmeister?sort=controversial", "source_domain": "www.reddit.com", "snapshot_id": "crawl=CC-MAIN-2015-11", "warc_metadata": { "Content-Length": "93921", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:Y5NHIBXY6GP5WSCIYUOG7VA5KOKR2...
{ "line_start_idx": [ 0, 42, 43, 91, 92, 105, 106, 172, 173, 217, 218, 278, 279, 397, 398, 502, 503, 568, 569, 613, 614, 674, 675, 728, 729, 818, 819, 863, 864, 924, 925, 1051, 1052, ...
{ "red_pajama_v2": { "ccnet_original_length": 2972, "ccnet_original_nlines": 80, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.4283439517021179, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.019108280539...
{ "free_decimal_correspondence": { "primary": { "code": "004.67", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computers and Computer science" } }, "secondary": { "code": "005.1", "labels...
e3c4dd7183f5f028f56d5a7988cc68c4
End of preview. Expand in Data Studio

💻 EAI-Taxonomy Code w/ DCLM

🏆 Website | 🖥️ Code | 📖 Paper

A 564 billion token dataset of high-quality code curated from web data using taxonomy-based filtering.

🎯 Dataset Overview

This dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional code datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality code data.

💡 EAI-Taxonomy Code w/ DCLM (564B tokens): Documents targeting code that exhibit intermediate to advanced reasoning, combined with the DCLM classifier to filter for instruction-dense documents. Also includes mathematics content (51 - Mathematics) to match the scope of existing code datasets.

🏆 Performance

Our taxonomy-based approach achieves competitive results with significantly less curation effort:

Dataset HumanEval+ MBPP+ MMLU-CS Curation Complexity
DCLM-baseline 28.0% 45.5% 32.0% General web filtering
OpenCoder FW 26.2% 45.8% 27.7% Complex domain pipeline
EAI-Taxonomy Code 27.4% 46.6% 29.0% Simple semantic filter
EAI-Taxonomy Code w/ DCLM 28.7% 45.0% 47.0% + DCLM classifier

Results show competitive code generation performance with a +46.8% improvement in computer science knowledge (MMLU-CS) compared to baseline.

🔍 Key Findings

  • Code Generation: All datasets perform within statistical error on single-function generation benchmarks (HumanEval+, MBPP+)
  • Code Knowledge: Clear impact on general computer science knowledge when using taxonomy-curated data
  • Efficiency: Achieves strong performance without complex domain-specific curation pipelines

Dataset Schema Documentation

Overview

This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.

Core Fields

Field Type Description Path
id Int64 Unique identifier based on document hash id
text String The main textual content of the document text

EAI Taxonomy Classification

Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.

How to Load the Dataset

This section provides examples of how to load the EssentialAI/eai-taxonomy-code-w-dclm dataset using different Python libraries and frameworks.

Using Hugging Face Datasets (Standard Method)

The simplest way to load the dataset is using the Hugging Face datasets library:

from datasets import load_dataset

# Load the entire dataset
dataset = load_dataset("EssentialAI/eai-taxonomy-code-w-dclm")

# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")

You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:

from datasets import load_dataset

# Load in streaming mode
dataset = load_dataset("EssentialAI/eai-taxonomy-code-w-dclm", streaming=True)
data_stream = dataset["train"]

# Iterate through examples
for example in data_stream.take(5):
 print(example)

Using PySpark

For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:

# First install the required library:
# pip install pyspark_huggingface

import pyspark_huggingface
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-Code-w-DCLM").getOrCreate()

# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/eai-taxonomy-code-w-dclm")

# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()

# Load only specific columns for efficiency
df_subset = (
 spark.read.format("huggingface")
 .option("columns", '["column1", "column2"]') # Replace with actual column names
 .load("EssentialAI/eai-taxonomy-code-w-dclm")
)

# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_taxonomy_code_w_dclm_dataset")
result = spark.sql("""
 SELECT COUNT(*) as total_examples
 FROM eai_taxonomy_code_w_dclm_dataset
""")
result.show()

Using Daft

Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:

import daft

# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-code-w-dclm")

# Basic exploration
print("Dataset schema:")
df.schema()

print("First 5 rows:")
df.show(5)

If you need to access private datasets or use authentication:

import daft
from daft.io import IOConfig, HTTPConfig

io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-code-w-dclm", io_config=io_config)

Installation Requirements

Make sure you have the required libraries installed:

# For Hugging Face datasets
pip install datasets

# For PySpark with Hugging Face integration
pip install pyspark_huggingface

# For Daft
pip install daft

📜 License

Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.

📝 Citation

@misc{ai2025essentialwebv1024ttokens,
 title={Essential-Web v1.0: 24T tokens of organized web data}, 
 author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
 year={2025},
 eprint={2506.14111},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2506.14111}, 
}
Downloads last month
3,646

Collection including EssentialAI/eai-taxonomy-code-w-dclm

Paper for EssentialAI/eai-taxonomy-code-w-dclm