VOOZH about

URL: https://huggingface.co/datasets/EssentialAI/eai-taxonomy-stem-w-dclm

⇱ EssentialAI/eai-taxonomy-stem-w-dclm Β· Datasets at Hugging Face


Dataset Preview
Duplicate
id
int64
text
string
metadata
dict
line_start_n_end_idx
dict
quality_signals
dict
eai_taxonomy
dict
pid
string
6,041,369,874,394,701,000
Cant access local https site My domain is: mctrees.net My operating system is (include version): windows server 2016 (host server), windows 10(client), using google chrome My web server is (include version): IIS 10.0.14393.0 I can login to a root shell on my machine (yes or no, or I don’t know): its windows I’m using ...
{ "url": "https://community.letsencrypt.org/t/cant-access-local-https-site/33642", "source_domain": "community.letsencrypt.org", "snapshot_id": "CC-MAIN-2024-30", "warc_metadata": { "Content-Length": "46462", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:Q3GT52PGRZ...
{ "line_start_idx": [ 0, 29, 30, 56, 173, 226, 310, 417, 418, 746, 826, 964, 965, 972, 973, 990, 991, 1313, 1314, 1410, 1411, 1728, 1729, 1791, 1792, 1873, 1874, 1895, 1896, 2059, 2060, 2244...
{ "red_pajama_v2": { "ccnet_original_length": 5050, "ccnet_original_nlines": 100, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.4307149052619934, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.05648719891...
{ "free_decimal_correspondence": { "primary": { "code": "004.678", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computers and Computer science" } }, "secondary": { "code": "005.1", "label...
95b707066e5b3184ba9f3885007f98f1
8,822,321,820,984,856,000
< prev index next > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp Print this page 17 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 18 * 19 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 20 * or visit www.oracle.com if you need additional information or have any...
{ "url": "https://builds.shipilev.net/patch-openjdk-loom-fibers/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp.sdiff.html", "source_domain": "builds.shipilev.net", "snapshot_id": "CC-MAIN-2024-10", "warc_metadata": { "Content-Length": "20317", "Content-Type": "application/http; msgtype=response", "WAR...
{ "line_start_idx": [ 0, 20, 21, 66, 67, 83, 84, 154, 162, 242, 321, 340, 348, 357, 363, 395, 434, 477, 513, 554, 604, 652, 693, 729, 767, 806, 844, 845, 884, 926, 967, 1008, 1030, 1...
{ "red_pajama_v2": { "ccnet_original_length": 14351, "ccnet_original_nlines": 511, "rps_doc_curly_bracket": 0.0039021701086312532, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.08736348897218704, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps...
{ "free_decimal_correspondence": { "primary": { "code": "005.4", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "004.27", "labels": { ...
95b707066e5b3184ba9f3885007f98f1
2,070,308,879,048,077,600
The Most Common Hereditary Health Issues hereditary There are a number of hereditary health issues that can affect a person throughout their life that they can inherit from their parents. Some of these issues are more serious than others, but all can impact the quality of life for the person affected and their family...
{ "url": "https://www.healthtipslive.com/the-most-common-hereditary-health-issues/", "source_domain": "www.healthtipslive.com", "snapshot_id": "CC-MAIN-2024-26", "warc_metadata": { "Content-Length": "54364", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:OBCKBOAGGT5...
{ "line_start_idx": [ 0, 41, 42, 53, 54, 723, 724, 775, 776, 1316, 1317, 1917, 1918, 1966, 1967, 2290, 2291, 2302, 2303, 2345, 2346, 2953, 2954, 3499, 3500, 4205, 4206 ], "line_end_idx": [ 41, 42, ...
{ "red_pajama_v2": { "ccnet_original_length": 4266, "ccnet_original_nlines": 26, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.46585366129875183, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.00243901996...
{ "free_decimal_correspondence": { "primary": { "code": "616.042", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Pathology and Diseases" } }, "secondary": { "code": "616.85", "labels": { ...
95b707066e5b3184ba9f3885007f98f1
5,006,949,593,540,895,000
Unlocking the Mystery: How Many Beers Does It Take to Get Drunk? Curious about the magic number of beers needed to feel tipsy? Discover the mystery behind alcohol tolerance and intoxication levels. Crop anonymous male partners with glass bottles of alcoholic drink sitting at wooden table on weekend Image courtesy of...
{ "url": "https://www.recoveryprotocols.com/unlocking-the-mystery-how-many-beers-does-it-take-to-get-drunk/", "source_domain": "www.recoveryprotocols.com", "snapshot_id": "CC-MAIN-2024-38", "warc_metadata": { "Content-Length": "71745", "Content-Type": "application/http; msgtype=response", "WARC-Bloc...
{ "line_start_idx": [ 0, 65, 66, 199, 200, 302, 303, 345, 346, 425, 426, 807, 808, 827, 828, 1229, 1230, 1258, 1259, 1725, 1726, 1759, 1760, 2386, 2387, 2482, 2483, 2535, 2536, 2547, 2548, 3...
{ "red_pajama_v2": { "ccnet_original_length": 4335, "ccnet_original_nlines": 61, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3436657786369324, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0, "rps_do...
{ "free_decimal_correspondence": { "primary": { "code": "612.82", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Physiology" } }, "secondary": { "code": "613.7", "labels": { "level_1": ...
95b707066e5b3184ba9f3885007f98f1
-3,735,092,766,298,594,300
enlightenment/src/modules/sysinfo/cpuclock/cpuclock.c 916 lines 27 KiB C #include "cpuclock.h" #if defined(__OpenBSD__) || defined(__NetBSD__) #include <sys/param.h> #include <sys/sysctl.h> #endif typedef struct _Thread_Config Thread_Config; struct _Thread_Config { int interval; Instance *inst; }; typedef struct _Pst...
{ "url": "https://git.enlightenment.org/enlightenment/enlightenment/src/commit/4c2116ac0ffcda1e06a3de0767bd00087ccaa6f2/src/modules/sysinfo/cpuclock/cpuclock.c", "source_domain": "git.enlightenment.org", "snapshot_id": "CC-MAIN-2024-10", "warc_metadata": { "Content-Length": "393965", "Content-Type": "ap...
{ "line_start_idx": [ 0, 54, 55, 65, 72, 74, 75, 97, 145, 168, 192, 199, 244, 266, 268, 282, 298, 301, 346, 368, 370, 386, 395, 404, 415, 418, 438, 465, 467, 482, 508, 529, 545, 555,...
{ "red_pajama_v2": { "ccnet_original_length": 22240, "ccnet_original_nlines": 798, "rps_doc_curly_bracket": 0.00791366957128048, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.09745272248983383, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_w...
{ "free_decimal_correspondence": { "primary": { "code": "005.4", "labels": { "level_1": "General works, books and libraries, information sciences", "level_2": "", "level_3": "Computer programming" } }, "secondary": { "code": "621.392", "labels": { ...
95b707066e5b3184ba9f3885007f98f1
148,902,627,300,252,320
Discover the Fascinating Toronto Historical Weather Patterns Have you ever wondered what the temperature was like in Toronto in the past? Are you interested in the historical weather records of this vibrant city? Look no further! In this article, we will take a deep dive into the climate history of Toronto, exploring ...
{ "url": "https://historyoftoronto.ca/blog/discover-the-fascinating-toronto-historical-weather-patterns", "source_domain": "historyoftoronto.ca", "snapshot_id": "CC-MAIN-2024-38", "warc_metadata": { "Content-Length": "115270", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest"...
{ "line_start_idx": [ 0, 61, 62, 406, 407, 814, 815, 1301, 1302, 1633, 1634, 1672, 1673, 1869, 1870, 2101, 2102, 2321, 2322, 2554, 2555, 2779, 2780, 2950, 2951, 2989, 2990, 3197, 3198, 3218, 321...
{ "red_pajama_v2": { "ccnet_original_length": 53622, "ccnet_original_nlines": 494, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3466118276119232, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.0016237299...
{ "free_decimal_correspondence": { "primary": { "code": "551.609713", "labels": { "level_1": "Science and Natural history", "level_2": "Geology and Earth sciences", "level_3": "Physical geology and Geodynamics" } }, "secondary": { "code": "551.6", "lab...
95b707066e5b3184ba9f3885007f98f1
3,463,397,492,504,424,000
Have you ever watched or participated in a mosh pit dance? Essentially, every person is moving together while bumping into each other and getting hit. There are plenty of animals that move in groups, such as shoals of fishes, flocks of birds, or swarms of locusts, but most manage to proceed without continually bumping ...
{ "url": "https://journals.biologists.com/jeb/article/227/9/JEB246588/347066/Locusts-take-a-break-to-observe-and-move-with", "source_domain": "journals.biologists.com", "snapshot_id": "CC-MAIN-2024-26", "warc_metadata": { "Content-Length": "120252", "Content-Type": "application/http; msgtype=response", ...
{ "line_start_idx": [ 0, 1058, 1059, 2158, 2159, 2588, 2589, 3535, 3536, 3542, 3544, 3547, 3549, 3559, 3561, 3564, 3568, 3574, 3576, 3579, 3581, 3586, 3589, 3689, 3691, 3703, 3706, 3708, 3717 ], "li...
{ "red_pajama_v2": { "ccnet_original_length": 3718, "ccnet_original_nlines": 28, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.45636624097824097, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.00429185014...
{ "free_decimal_correspondence": { "primary": { "code": "595.78", "labels": { "level_1": "Science and Natural history", "level_2": "Zoology", "level_3": "Arthropoda and Worms" } }, "secondary": { "code": "591.57", "labels": { "level_1": "Scienc...
95b707066e5b3184ba9f3885007f98f1
6,096,470,680,485,917,000
"Collectins and ficolins are important in the clearance of endogenous and exogenous danger materials(...TRUNCATED)
{"url":"https://karger.com/jin/article/5/3/242/180268/Collectin-11-MASP-Complex-Formation-Triggers",(...TRUNCATED)
{"line_start_idx":[0,1264,1265,3656,3657,5441,5442,5468,5469,5722,5723,5762,5763,7069,7070,7104,7105(...TRUNCATED)
{"red_pajama_v2":{"ccnet_original_length":36903.0,"ccnet_original_nlines":174.0,"rps_doc_curly_brack(...TRUNCATED)
{"free_decimal_correspondence":{"primary":{"code":"616.0792","labels":{"level_1":"Industrial arts, T(...TRUNCATED)
95b707066e5b3184ba9f3885007f98f1
-6,011,722,529,215,334,000
"Search query construction issues. Plz help!\n\nI have 2 mysql tables containing data which relate t(...TRUNCATED)
{"url":"https://www.sitepoint.com/community/t/search-query-construction-issues-plz-help/82910","sour(...TRUNCATED)
{"line_start_idx":[0,44,45,111,209,210,327,328,407,567,568,674,675,827,828,966,967,1014,1015,1126,11(...TRUNCATED)
{"red_pajama_v2":{"ccnet_original_length":6869.0,"ccnet_original_nlines":137.0,"rps_doc_curly_bracke(...TRUNCATED)
{"free_decimal_correspondence":{"primary":{"code":"005.74","labels":{"level_1":"General works, books(...TRUNCATED)
95b707066e5b3184ba9f3885007f98f1
7,644,481,838,083,769,000
"Exploring the Benefits of Lidocaine Cream Over the Counter\n\nAre you curious about the benefits of(...TRUNCATED)
{"url":"https://www.kartal24.com/en/exploring-the-benefits-of-lidocaine-cream-over-the-counter","sou(...TRUNCATED)
{"line_start_idx":[0,59,60,494,495,526,527,1374,1375,1413,1414,2129,2130,2185,2186,3118,3119,3182,31(...TRUNCATED)
{"red_pajama_v2":{"ccnet_original_length":5485.0,"ccnet_original_nlines":26.0,"rps_doc_curly_bracket(...TRUNCATED)
{"free_decimal_correspondence":{"primary":{"code":"615.6","labels":{"level_1":"Industrial arts, Tech(...TRUNCATED)
95b707066e5b3184ba9f3885007f98f1
End of preview.

πŸ”¬ EAI-Taxonomy STEM w/ DCLM

πŸ† Website | πŸ–₯️ Code | πŸ“– Paper

A high-quality STEM dataset curated from web data using taxonomy-based filtering, containing 1742 billion tokens of science, technology, engineering, and mathematics content.

🎯 Dataset Overview

This dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional STEM datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality STEM content.

πŸ§ͺ EAI-Taxonomy STEM w/ DCLM (1742B tokens): Documents targeting science, engineering, medical, and computer science content that exhibit reasoning, combined with the DCLM classifier to filter for instruction-dense documents.

πŸ† Performance

Our taxonomy-based approach achieves superior results with significantly less curation effort:

Dataset MMLU-STEM Curation Complexity
DCLM-baseline 27.7% General web filtering
FineWeb-Edu 26.7% Educational filtering
EAI-Taxonomy STEM 29.1% Simple semantic filter
EAI-Taxonomy STEM w/ DCLM 34.5% + DCLM classifier

Results show +24.5% improvement over DCLM and +29.2% improvement over FineWeb-Edu.

πŸ” Key Findings

  • Strong STEM Performance: Outperforms baseline and educational datasets beyond standard error
  • Efficient Curation: Achieves superior results without complex domain-specific pipelines
  • Broad Coverage: Encompasses science, engineering, medical, and computer science domains
  • Quality Focus: Selects high-quality document types and filters for reasoning content

Dataset Schema Documentation

Overview

This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.

Core Fields

Field Type Description Path
id Int64 Unique identifier based on document hash id
text String The main textual content of the document text

EAI Taxonomy Classification

Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.

How to Load the Dataset

This section provides examples of how to load the EssentialAI/eai-taxonomy-stem-w-dclm dataset using different Python libraries and frameworks.

Using Hugging Face Datasets (Standard Method)

The simplest way to load the dataset is using the Hugging Face datasets library:

from datasets import load_dataset

# Load the entire dataset
dataset = load_dataset("EssentialAI/eai-taxonomy-stem-w-dclm")

# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")

You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:

from datasets import load_dataset

# Load in streaming mode
dataset = load_dataset("EssentialAI/eai-taxonomy-stem-w-dclm", streaming=True)
data_stream = dataset["train"]

# Iterate through examples
for example in data_stream.take(5):
 print(example)

Using PySpark

For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:

# First install the required library:
# pip install pyspark_huggingface

import pyspark_huggingface
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-STEM-w-DCLM").getOrCreate()

# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/eai-taxonomy-stem-w-dclm")

# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()

# Load only specific columns for efficiency
df_subset = (
 spark.read.format("huggingface")
 .option("columns", '["column1", "column2"]') # Replace with actual column names
 .load("EssentialAI/eai-taxonomy-stem-w-dclm")
)

# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_taxonomy_stem_w_dclm_dataset")
result = spark.sql("""
 SELECT COUNT(*) as total_examples
 FROM eai_taxonomy_stem_w_dclm_dataset
""")
result.show()

Using Daft

Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:

import daft

# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-stem-w-dclm")

# Basic exploration
print("Dataset schema:")
df.schema()

print("First 5 rows:")
df.show(5)

If you need to access private datasets or use authentication:

import daft
from daft.io import IOConfig, HTTPConfig

io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-stem-w-dclm", io_config=io_config)

Installation Requirements

Make sure you have the required libraries installed:

# For Hugging Face datasets
pip install datasets

# For PySpark with Hugging Face integration
pip install pyspark_huggingface

# For Daft
pip install daft

πŸ“œ License

Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.

πŸ“ Citation

@misc{ai2025essentialwebv1024ttokens,
 title={Essential-Web v1.0: 24T tokens of organized web data}, 
 author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
 year={2025},
 eprint={2506.14111},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2506.14111}, 
}
Downloads last month
1,806

Models trained or fine-tuned on EssentialAI/eai-taxonomy-stem-w-dclm

Collection including EssentialAI/eai-taxonomy-stem-w-dclm

Paper for EssentialAI/eai-taxonomy-stem-w-dclm