id int64 | text string | metadata dict | line_start_n_end_idx dict | quality_signals dict | eai_taxonomy dict | pid string |
|---|---|---|---|---|---|---|
-4,041,170,850,086,211,600 | OSNews: http://www.osnews.com/story/17169/OpenOffice_org_2_0_RC1_for_OS_2_eComStation Exploring the Future of Computing en-us Copyright 2001-2015, David Adams adam+nospam@osnews.com Wed, 25 Nov 2015 20:51:04 GMT http://www.osnews.com/images/osnews.gif OSNews.com http://www.osnews.com FYI http://www.osnews.com/thread?20... | {
"url": "http://www.osnews.com/story/17169/OpenOffice_org_2_0_RC1_for_OS_2_eComStation/feed",
"source_domain": "www.osnews.com",
"snapshot_id": "crawl=CC-MAIN-2015-48",
"warc_metadata": {
"Content-Length": "20263",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:FAA... | {
"line_start_idx": [
0,
1193,
1194,
1443,
1444,
1637,
1638,
1866,
1867,
2091,
2092,
2218,
2288,
2507,
2508,
2649,
2650,
2729,
2730,
2874,
2951,
3113,
3114,
3249,
3322,
3512,
3513,
3554,
3555,
3620,... | {
"red_pajama_v2": {
"ccnet_original_length": 14224,
"ccnet_original_nlines": 104,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 3,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3135643005371094,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0343511514... | {
"free_decimal_correspondence": {
"primary": {
"code": "004.16",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computers and Computer science"
}
},
"secondary": {
"code": "004.019",
"labe... | e3c4dd7183f5f028f56d5a7988cc68c4 |
8,252,207,289,312,992,000 | Posts Tagged ‘nokia lumia 520 vs samsung galaxy s3’
Nokia Lumia 920 vs Samsung Galaxy S3 vs HTC One X
September 8th, 2012
The mobile world has changed a lot since Nokia last put out a phone that truly wowed large amounts of people. Its tie in with Microsoft spawned some half decent handsets but despite Nokia’s best ... | {
"url": "http://www.freshersbeat.com/tag/nokia-lumia-520-vs-samsung-galaxy-s3",
"source_domain": "www.freshersbeat.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "28810",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:OHHJHEJGUQY... | {
"line_start_idx": [
0,
52,
53,
103,
104,
124,
125,
367,
368,
725,
788,
789,
836,
837
],
"line_end_idx": [
52,
53,
103,
104,
124,
125,
367,
368,
725,
788,
789,
836,
837,
859
]
} | {
"red_pajama_v2": {
"ccnet_original_length": 859,
"ccnet_original_nlines": 13,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3535911738872528,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0386740304529... | {
"free_decimal_correspondence": {
"primary": {
"code": "004.16",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computers and Computer science"
}
},
"secondary": {
"code": "658.85",
"label... | e3c4dd7183f5f028f56d5a7988cc68c4 |
-2,439,845,835,742,470,700 | Unlimited Plugins, WordPress themes, videos & courses! Unlimited asset downloads! From $16.50/m
Advertisement
1. Code
2. HTML5
HTML5 Mastery: Encoding
by
Difficulty:IntermediateLength:MediumLanguages:
This post is part of a series called HTML5 Mastery Class.
HTML5 Mastery: Scoping Rules
HTML5 Mastery: Fragments
H... | {
"url": "https://code.tutsplus.com/tutorials/html5-mastery-encoding--cms-24841",
"source_domain": "code.tutsplus.com",
"snapshot_id": "crawl=CC-MAIN-2021-17",
"warc_metadata": {
"Content-Length": "108873",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:DEYFWD7JHO72... | {
"line_start_idx": [
0,
96,
110,
120,
131,
132,
156,
157,
160,
207,
265,
294,
319,
333,
334,
544,
545,
876,
877,
933,
934,
981,
1031,
1106,
1107,
1385,
1386,
1414,
1415,
1921,
1922,
2185,
... | {
"red_pajama_v2": {
"ccnet_original_length": 12401,
"ccnet_original_nlines": 120,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.41379308700561523,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.023384859... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.1",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "004.678",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
8,217,648,156,204,675,000 | Program of bug simulation , JAVA Programming
You will be creating a World that consists of ants and doodlebugs. Each time you click the board each bug will do some of the following: move, bread, eat, and starve.
Ants will function in a certain way, and doodlebugs in another.
This assignment is based on Absolute Java... | {
"url": "http://www.expertsmind.com/questions/program-of-bug-simulation-30135070.aspx",
"source_domain": "www.expertsmind.com",
"snapshot_id": "crawl=CC-MAIN-2017-09",
"warc_metadata": {
"Content-Length": "35840",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:M2FM... | {
"line_start_idx": [
0,
45,
46,
213,
214,
278,
279,
322,
323,
348,
349,
359,
360,
365,
366,
371,
372,
423,
505,
586,
587,
593,
594,
657,
726,
811,
880,
881,
892,
893,
898,
899,
945,
... | {
"red_pajama_v2": {
"ccnet_original_length": 5768,
"ccnet_original_nlines": 147,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3774940073490143,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.01596168987... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.1",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "595.79",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
3,872,857,292,200,175,600 | Kennwortmanager KeePassX Weiterentwicklung der Version 1
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
This repo is archived. You can view files and clone it, but cannot push or open issues/pull-requests.
keepass... | {
"url": "https://git.piratenpartei-sh.de/thooge/keepassx1/src/branch/master/src/res/docs/quickstart.html",
"source_domain": "git.piratenpartei-sh.de",
"snapshot_id": "CC-MAIN-2024-18",
"warc_metadata": {
"Content-Length": "128123",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-D... | {
"line_start_idx": [
0,
57,
201,
303,
305,
307,
309,
311,
313,
352,
353,
363,
371,
372,
429,
482,
489,
496,
538,
546,
553,
570,
606,
610,
646,
696,
701,
722,
743,
747,
795,
835,
884,
... | {
"red_pajama_v2": {
"ccnet_original_length": 9513,
"ccnet_original_nlines": 244,
"rps_doc_curly_bracket": 0.0033638200256973505,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.2645992934703827,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_w... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.822",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "005.82",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
-5,359,282,950,657,512,000 | tirsdag 16. februar 2010
Minority Report computer interface designer demos the real thing (video)
At the big-think, big-demo TED conference in Long Beach last week, MIT Media Lab alumnus John Underkoffler demonstrated a real working version of the memorable grab-it-and-throw-it computer interface he designed for Tom ... | {
"url": "http://norgenews.blogspot.com/2010/02/minority-report-computer-interface.html",
"source_domain": "norgenews.blogspot.com",
"snapshot_id": "crawl=CC-MAIN-2018-30",
"warc_metadata": {
"Content-Length": "57936",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:... | {
"line_start_idx": [
0,
25,
26,
99,
100,
377,
661,
842,
999,
1142,
1297,
1474,
1786,
2055,
2193,
2251,
2252,
2271,
2272
],
"line_end_idx": [
25,
26,
99,
100,
377,
661,
842,
999,
1142,
1297,
147... | {
"red_pajama_v2": {
"ccnet_original_length": 2293,
"ccnet_original_nlines": 18,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3146551847457886,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.010775860399... | {
"free_decimal_correspondence": {
"primary": {
"code": "004.019",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computers and Computer science"
}
},
"secondary": {
"code": "791.4372",
"la... | e3c4dd7183f5f028f56d5a7988cc68c4 |
-7,295,809,349,029,438,000 | Blogs
Drone On
Well, I finally built me a drone so's I could fit in with all the cool kids. What follows is a short description of my experience with helpful links for someone else who would like to build a substantially similar quad. I built basically the cheapest quadcopter you could use for anything more than just... | {
"url": "http://hyperlogos.org/blog?page=2",
"source_domain": "hyperlogos.org",
"snapshot_id": "crawl=CC-MAIN-2017-39",
"warc_metadata": {
"Content-Length": "40860",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:JQJJONCLO5CA3YUCGATR3I34IN6ATAR3",
"WARC-Concurr... | {
"line_start_idx": [
0,
6,
7,
16,
17,
439,
440,
480,
481,
886,
887,
916,
917,
1315,
1316,
1339,
1340,
1862,
1863,
1903,
1904,
2353,
2354,
2390,
2391,
2872,
2873,
2907,
2908,
3047,
3048,
336... | {
"red_pajama_v2": {
"ccnet_original_length": 4507,
"ccnet_original_nlines": 42,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.47089946269989014,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.05185185000... | {
"free_decimal_correspondence": {
"primary": {
"code": "004.0285636",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computers and Computer science"
}
},
"secondary": {
"code": "629.117",
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
5,915,462,743,057,457,000 | Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.
Say there is such table:
mysql> SELECT * FROM tags;
+---------+--------+
| post_id | tag_id |
+---------+--------+
| 1 | 2 |
| 1 | 3 |
| ... | {
"url": "http://stackoverflow.com/questions/3083409/mysql-how-to-select-groups-having-certain-values",
"source_domain": "stackoverflow.com",
"snapshot_id": "crawl=CC-MAIN-2014-10",
"warc_metadata": {
"Content-Length": "82233",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest... | {
"line_start_idx": [
0,
25,
157,
158,
183,
184,
211,
232,
253,
274,
295,
316,
337,
358,
379,
400,
425,
426,
726,
727,
755,
767,
768,
778,
779,
808,
809,
885,
886,
902,
913,
945,
963,
... | {
"red_pajama_v2": {
"ccnet_original_length": 3896,
"ccnet_original_nlines": 130,
"rps_doc_curly_bracket": 0.0005133500089868903,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.24231678247451782,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.44",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "005.1",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
6,129,714,596,294,449,000 | IRC log of css on 2011-02-16
Timestamps are in UTC.
16:42:21 [RRSAgent]
RRSAgent has joined #css
16:42:21 [RRSAgent]
logging to http://www.w3.org/2011/02/16-css-irc
16:42:28 [glazou]
Zakim, this will be Style
16:42:28 [Zakim]
ok, glazou; I see Style_CSS FP()12:00PM scheduled to start in 18 minutes
16:42:33 [glazou]
R... | {
"url": "http://www.w3.org/2011/02/16-css-irc",
"source_domain": "www.w3.org",
"snapshot_id": "crawl=CC-MAIN-2014-10",
"warc_metadata": {
"Content-Length": "31968",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:5MOPXULU5JV52YTUR3BOKUXVQWY7QOL6",
"WARC-Concurre... | {
"line_start_idx": [
0,
29,
30,
53,
54,
74,
99,
119,
167,
185,
211,
228,
301,
319,
346,
364,
378,
395,
497,
515,
538,
556,
579,
596,
634,
651,
664,
681,
687,
705,
723,
740,
756,
775... | {
"red_pajama_v2": {
"ccnet_original_length": 20895,
"ccnet_original_nlines": 557,
"rps_doc_curly_bracket": 0.00019142999371979386,
"rps_doc_ldnoobw_words": 2,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.2037786841392517,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.4",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "004.019",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
-9,190,456,498,215,951,000 | File: vt_text.sql
package info (click to toggle)
virtuoso-opensource 6.1.6+dfsg2-4
• links: PTS, VCS
• area: main
• in suites: bullseye, buster, sid, stretch
• size: 260,992 kB
• ctags: 125,220
• sloc: ansic: 652,748; sql: 458,419; xml: 282,834; java: 61,031; sh: 40,031; cpp: 36,890; cs: 25,240; php: 12,69... | {
"url": "https://sources.debian.org/src/virtuoso-opensource/6.1.6+dfsg2-4/libsrc/Wi/vt_text.sql/",
"source_domain": "sources.debian.org",
"snapshot_id": "crawl=CC-MAIN-2020-05",
"warc_metadata": {
"Content-Length": "69346",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": ... | {
"line_start_idx": [
0,
18,
19,
50,
84,
104,
119,
165,
186,
205,
427,
526,
528,
530,
532,
534,
536,
538,
540,
542,
544,
547,
550,
553,
556,
559,
562,
565,
568,
571,
574,
577,
580,
5... | {
"red_pajama_v2": {
"ccnet_original_length": 20293,
"ccnet_original_nlines": 972,
"rps_doc_curly_bracket": 0.003843690035864711,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.07587961107492447,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.746",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "005.1",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
-2,500,094,439,207,875,000 | Get the most out of your Centmin Mod LEMP stack
Become a Member
MariaDB Why does 123.08centos7beta02 my.cnf not include `innodb_buffer_pool_instances` variable?
Discussion in 'Nginx, PHP-FPM & MariaDB MySQL' started by jeffwidman, Apr 18, 2015.
1. jeffwidman
jeffwidman Active Member
152
27
28
... | {
"url": "https://community.centminmod.com/threads/why-does-123-08centos7beta02-my-cnf-not-include-innodb_buffer_pool_instances-variable.2796/",
"source_domain": "community.centminmod.com",
"snapshot_id": "crawl=CC-MAIN-2021-21",
"warc_metadata": {
"Content-Length": "152249",
"Content-Type": "applicatio... | {
"line_start_idx": [
0,
48,
64,
65,
162,
163,
247,
248,
264,
265,
294,
295,
303,
310,
317,
333,
346,
354,
370,
382,
398,
619,
620,
661,
662,
916,
917,
1285,
1286,
1292,
1305,
1306,
1345... | {
"red_pajama_v2": {
"ccnet_original_length": 6332,
"ccnet_original_nlines": 171,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3147566616535187,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.04238618910... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.44",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "005.72",
"labels": {
... | e3c4dd7183f5f028f56d5a7988cc68c4 |
-1,242,514,781,259,357,000 | reddit's stories are created by its users
join the community, vote, and change the world.
learn more ›
Why is there both an e-mail AND a Gmail app??? by chka in Android
[–]dajmeister 0 points1 point (0 children)
sorry, this has been archived and can no longer be voted on
Honestly i just pull all my email into gm... | {
"url": "http://www.reddit.com/user/dajmeister?sort=controversial",
"source_domain": "www.reddit.com",
"snapshot_id": "crawl=CC-MAIN-2015-11",
"warc_metadata": {
"Content-Length": "93921",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:Y5NHIBXY6GP5WSCIYUOG7VA5KOKR2... | {
"line_start_idx": [
0,
42,
43,
91,
92,
105,
106,
172,
173,
217,
218,
278,
279,
397,
398,
502,
503,
568,
569,
613,
614,
674,
675,
728,
729,
818,
819,
863,
864,
924,
925,
1051,
1052,
... | {
"red_pajama_v2": {
"ccnet_original_length": 2972,
"ccnet_original_nlines": 80,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.4283439517021179,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.019108280539... | {
"free_decimal_correspondence": {
"primary": {
"code": "004.67",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computers and Computer science"
}
},
"secondary": {
"code": "005.1",
"labels... | e3c4dd7183f5f028f56d5a7988cc68c4 |
💻 EAI-Taxonomy Code w/ DCLM
A 564 billion token dataset of high-quality code curated from web data using taxonomy-based filtering.
🎯 Dataset Overview
This dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional code datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality code data.
💡 EAI-Taxonomy Code w/ DCLM (564B tokens): Documents targeting code that exhibit intermediate to advanced reasoning, combined with the DCLM classifier to filter for instruction-dense documents. Also includes mathematics content (51 - Mathematics) to match the scope of existing code datasets.
🏆 Performance
Our taxonomy-based approach achieves competitive results with significantly less curation effort:
| Dataset | HumanEval+ | MBPP+ | MMLU-CS | Curation Complexity |
|---|---|---|---|---|
| DCLM-baseline | 28.0% | 45.5% | 32.0% | General web filtering |
| OpenCoder FW | 26.2% | 45.8% | 27.7% | Complex domain pipeline |
| EAI-Taxonomy Code | 27.4% | 46.6% | 29.0% | Simple semantic filter |
| EAI-Taxonomy Code w/ DCLM | 28.7% | 45.0% | 47.0% | + DCLM classifier |
Results show competitive code generation performance with a +46.8% improvement in computer science knowledge (MMLU-CS) compared to baseline.
🔍 Key Findings
- Code Generation: All datasets perform within statistical error on single-function generation benchmarks (HumanEval+, MBPP+)
- Code Knowledge: Clear impact on general computer science knowledge when using taxonomy-curated data
- Efficiency: Achieves strong performance without complex domain-specific curation pipelines
Dataset Schema Documentation
Overview
This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.
Core Fields
| Field | Type | Description | Path |
|---|---|---|---|
id |
Int64 |
Unique identifier based on document hash | id |
text |
String |
The main textual content of the document | text |
EAI Taxonomy Classification
Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.
How to Load the Dataset
This section provides examples of how to load the EssentialAI/eai-taxonomy-code-w-dclm dataset using different Python libraries and frameworks.
Using Hugging Face Datasets (Standard Method)
The simplest way to load the dataset is using the Hugging Face datasets library:
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("EssentialAI/eai-taxonomy-code-w-dclm")
# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")
You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:
from datasets import load_dataset
# Load in streaming mode
dataset = load_dataset("EssentialAI/eai-taxonomy-code-w-dclm", streaming=True)
data_stream = dataset["train"]
# Iterate through examples
for example in data_stream.take(5):
print(example)
Using PySpark
For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:
# First install the required library:
# pip install pyspark_huggingface
import pyspark_huggingface
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-Code-w-DCLM").getOrCreate()
# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/eai-taxonomy-code-w-dclm")
# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()
# Load only specific columns for efficiency
df_subset = (
spark.read.format("huggingface")
.option("columns", '["column1", "column2"]') # Replace with actual column names
.load("EssentialAI/eai-taxonomy-code-w-dclm")
)
# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_taxonomy_code_w_dclm_dataset")
result = spark.sql("""
SELECT COUNT(*) as total_examples
FROM eai_taxonomy_code_w_dclm_dataset
""")
result.show()
Using Daft
Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:
import daft
# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-code-w-dclm")
# Basic exploration
print("Dataset schema:")
df.schema()
print("First 5 rows:")
df.show(5)
If you need to access private datasets or use authentication:
import daft
from daft.io import IOConfig, HTTPConfig
io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-code-w-dclm", io_config=io_config)
Installation Requirements
Make sure you have the required libraries installed:
# For Hugging Face datasets
pip install datasets
# For PySpark with Hugging Face integration
pip install pyspark_huggingface
# For Daft
pip install daft
📜 License
Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.
📝 Citation
@misc{ai2025essentialwebv1024ttokens,
title={Essential-Web v1.0: 24T tokens of organized web data},
author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
year={2025},
eprint={2506.14111},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14111},
}
- Downloads last month
- 3,646
