id int64 | text string | metadata dict | line_start_n_end_idx dict | quality_signals dict | eai_taxonomy dict | pid string |
|---|---|---|---|---|---|---|
6,041,369,874,394,701,000 | Cant access local https site
My domain is: mctrees.net
My operating system is (include version): windows server 2016 (host server), windows 10(client), using google chrome
My web server is (include version): IIS 10.0.14393.0
I can login to a root shell on my machine (yes or no, or I donβt know): its windows
Iβm using ... | {
"url": "https://community.letsencrypt.org/t/cant-access-local-https-site/33642",
"source_domain": "community.letsencrypt.org",
"snapshot_id": "CC-MAIN-2024-30",
"warc_metadata": {
"Content-Length": "46462",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:Q3GT52PGRZ... | {
"line_start_idx": [
0,
29,
30,
56,
173,
226,
310,
417,
418,
746,
826,
964,
965,
972,
973,
990,
991,
1313,
1314,
1410,
1411,
1728,
1729,
1791,
1792,
1873,
1874,
1895,
1896,
2059,
2060,
2244... | {
"red_pajama_v2": {
"ccnet_original_length": 5050,
"ccnet_original_nlines": 100,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.4307149052619934,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.05648719891... | {
"free_decimal_correspondence": {
"primary": {
"code": "004.678",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computers and Computer science"
}
},
"secondary": {
"code": "005.1",
"label... | 95b707066e5b3184ba9f3885007f98f1 |
8,822,321,820,984,856,000 | < prev index next >
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
Print this page
17 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
18 *
19 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
20 * or visit www.oracle.com if you need additional information or have any... | {
"url": "https://builds.shipilev.net/patch-openjdk-loom-fibers/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp.sdiff.html",
"source_domain": "builds.shipilev.net",
"snapshot_id": "CC-MAIN-2024-10",
"warc_metadata": {
"Content-Length": "20317",
"Content-Type": "application/http; msgtype=response",
"WAR... | {
"line_start_idx": [
0,
20,
21,
66,
67,
83,
84,
154,
162,
242,
321,
340,
348,
357,
363,
395,
434,
477,
513,
554,
604,
652,
693,
729,
767,
806,
844,
845,
884,
926,
967,
1008,
1030,
1... | {
"red_pajama_v2": {
"ccnet_original_length": 14351,
"ccnet_original_nlines": 511,
"rps_doc_curly_bracket": 0.0039021701086312532,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.08736348897218704,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.4",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "004.27",
"labels": {
... | 95b707066e5b3184ba9f3885007f98f1 |
2,070,308,879,048,077,600 | The Most Common Hereditary Health Issues
hereditary
There are a number of hereditary health issues that can affect a person throughout their life that they can inherit from their parents. Some of these issues are more serious than others, but all can impact the quality of life for the person affected and their family... | {
"url": "https://www.healthtipslive.com/the-most-common-hereditary-health-issues/",
"source_domain": "www.healthtipslive.com",
"snapshot_id": "CC-MAIN-2024-26",
"warc_metadata": {
"Content-Length": "54364",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:OBCKBOAGGT5... | {
"line_start_idx": [
0,
41,
42,
53,
54,
723,
724,
775,
776,
1316,
1317,
1917,
1918,
1966,
1967,
2290,
2291,
2302,
2303,
2345,
2346,
2953,
2954,
3499,
3500,
4205,
4206
],
"line_end_idx": [
41,
42,
... | {
"red_pajama_v2": {
"ccnet_original_length": 4266,
"ccnet_original_nlines": 26,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.46585366129875183,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.00243901996... | {
"free_decimal_correspondence": {
"primary": {
"code": "616.042",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Pathology and Diseases"
}
},
"secondary": {
"code": "616.85",
"labels": {
... | 95b707066e5b3184ba9f3885007f98f1 |
5,006,949,593,540,895,000 | Unlocking the Mystery: How Many Beers Does It Take to Get Drunk?
Curious about the magic number of beers needed to feel tipsy? Discover the mystery behind alcohol tolerance and intoxication levels.
Crop anonymous male partners with glass bottles of alcoholic drink sitting at wooden table on weekend
Image courtesy of... | {
"url": "https://www.recoveryprotocols.com/unlocking-the-mystery-how-many-beers-does-it-take-to-get-drunk/",
"source_domain": "www.recoveryprotocols.com",
"snapshot_id": "CC-MAIN-2024-38",
"warc_metadata": {
"Content-Length": "71745",
"Content-Type": "application/http; msgtype=response",
"WARC-Bloc... | {
"line_start_idx": [
0,
65,
66,
199,
200,
302,
303,
345,
346,
425,
426,
807,
808,
827,
828,
1229,
1230,
1258,
1259,
1725,
1726,
1759,
1760,
2386,
2387,
2482,
2483,
2535,
2536,
2547,
2548,
3... | {
"red_pajama_v2": {
"ccnet_original_length": 4335,
"ccnet_original_nlines": 61,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3436657786369324,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0,
"rps_do... | {
"free_decimal_correspondence": {
"primary": {
"code": "612.82",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Physiology"
}
},
"secondary": {
"code": "613.7",
"labels": {
"level_1": ... | 95b707066e5b3184ba9f3885007f98f1 |
-3,735,092,766,298,594,300 | enlightenment/src/modules/sysinfo/cpuclock/cpuclock.c
916 lines
27 KiB
C
#include "cpuclock.h"
#if defined(__OpenBSD__) || defined(__NetBSD__)
#include <sys/param.h>
#include <sys/sysctl.h>
#endif
typedef struct _Thread_Config Thread_Config;
struct _Thread_Config
{
int interval;
Instance *inst;
};
typedef struct _Pst... | {
"url": "https://git.enlightenment.org/enlightenment/enlightenment/src/commit/4c2116ac0ffcda1e06a3de0767bd00087ccaa6f2/src/modules/sysinfo/cpuclock/cpuclock.c",
"source_domain": "git.enlightenment.org",
"snapshot_id": "CC-MAIN-2024-10",
"warc_metadata": {
"Content-Length": "393965",
"Content-Type": "ap... | {
"line_start_idx": [
0,
54,
55,
65,
72,
74,
75,
97,
145,
168,
192,
199,
244,
266,
268,
282,
298,
301,
346,
368,
370,
386,
395,
404,
415,
418,
438,
465,
467,
482,
508,
529,
545,
555,... | {
"red_pajama_v2": {
"ccnet_original_length": 22240,
"ccnet_original_nlines": 798,
"rps_doc_curly_bracket": 0.00791366957128048,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.09745272248983383,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_w... | {
"free_decimal_correspondence": {
"primary": {
"code": "005.4",
"labels": {
"level_1": "General works, books and libraries, information sciences",
"level_2": "",
"level_3": "Computer programming"
}
},
"secondary": {
"code": "621.392",
"labels": {
... | 95b707066e5b3184ba9f3885007f98f1 |
148,902,627,300,252,320 | Discover the Fascinating Toronto Historical Weather Patterns
Have you ever wondered what the temperature was like in Toronto in the past? Are you interested in the historical weather records of this vibrant city? Look no further! In this article, we will take a deep dive into the climate history of Toronto, exploring ... | {
"url": "https://historyoftoronto.ca/blog/discover-the-fascinating-toronto-historical-weather-patterns",
"source_domain": "historyoftoronto.ca",
"snapshot_id": "CC-MAIN-2024-38",
"warc_metadata": {
"Content-Length": "115270",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest"... | {
"line_start_idx": [
0,
61,
62,
406,
407,
814,
815,
1301,
1302,
1633,
1634,
1672,
1673,
1869,
1870,
2101,
2102,
2321,
2322,
2554,
2555,
2779,
2780,
2950,
2951,
2989,
2990,
3197,
3198,
3218,
321... | {
"red_pajama_v2": {
"ccnet_original_length": 53622,
"ccnet_original_nlines": 494,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3466118276119232,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0016237299... | {
"free_decimal_correspondence": {
"primary": {
"code": "551.609713",
"labels": {
"level_1": "Science and Natural history",
"level_2": "Geology and Earth sciences",
"level_3": "Physical geology and Geodynamics"
}
},
"secondary": {
"code": "551.6",
"lab... | 95b707066e5b3184ba9f3885007f98f1 |
3,463,397,492,504,424,000 | Have you ever watched or participated in a mosh pit dance? Essentially, every person is moving together while bumping into each other and getting hit. There are plenty of animals that move in groups, such as shoals of fishes, flocks of birds, or swarms of locusts, but most manage to proceed without continually bumping ... | {
"url": "https://journals.biologists.com/jeb/article/227/9/JEB246588/347066/Locusts-take-a-break-to-observe-and-move-with",
"source_domain": "journals.biologists.com",
"snapshot_id": "CC-MAIN-2024-26",
"warc_metadata": {
"Content-Length": "120252",
"Content-Type": "application/http; msgtype=response",
... | {
"line_start_idx": [
0,
1058,
1059,
2158,
2159,
2588,
2589,
3535,
3536,
3542,
3544,
3547,
3549,
3559,
3561,
3564,
3568,
3574,
3576,
3579,
3581,
3586,
3589,
3689,
3691,
3703,
3706,
3708,
3717
],
"li... | {
"red_pajama_v2": {
"ccnet_original_length": 3718,
"ccnet_original_nlines": 28,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.45636624097824097,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.00429185014... | {
"free_decimal_correspondence": {
"primary": {
"code": "595.78",
"labels": {
"level_1": "Science and Natural history",
"level_2": "Zoology",
"level_3": "Arthropoda and Worms"
}
},
"secondary": {
"code": "591.57",
"labels": {
"level_1": "Scienc... | 95b707066e5b3184ba9f3885007f98f1 |
6,096,470,680,485,917,000 | "Collectins and ficolins are important in the clearance of endogenous and exogenous danger materials(...TRUNCATED) | {"url":"https://karger.com/jin/article/5/3/242/180268/Collectin-11-MASP-Complex-Formation-Triggers",(...TRUNCATED) | {"line_start_idx":[0,1264,1265,3656,3657,5441,5442,5468,5469,5722,5723,5762,5763,7069,7070,7104,7105(...TRUNCATED) | {"red_pajama_v2":{"ccnet_original_length":36903.0,"ccnet_original_nlines":174.0,"rps_doc_curly_brack(...TRUNCATED) | {"free_decimal_correspondence":{"primary":{"code":"616.0792","labels":{"level_1":"Industrial arts, T(...TRUNCATED) | 95b707066e5b3184ba9f3885007f98f1 |
-6,011,722,529,215,334,000 | "Search query construction issues. Plz help!\n\nI have 2 mysql tables containing data which relate t(...TRUNCATED) | {"url":"https://www.sitepoint.com/community/t/search-query-construction-issues-plz-help/82910","sour(...TRUNCATED) | {"line_start_idx":[0,44,45,111,209,210,327,328,407,567,568,674,675,827,828,966,967,1014,1015,1126,11(...TRUNCATED) | {"red_pajama_v2":{"ccnet_original_length":6869.0,"ccnet_original_nlines":137.0,"rps_doc_curly_bracke(...TRUNCATED) | {"free_decimal_correspondence":{"primary":{"code":"005.74","labels":{"level_1":"General works, books(...TRUNCATED) | 95b707066e5b3184ba9f3885007f98f1 |
7,644,481,838,083,769,000 | "Exploring the Benefits of Lidocaine Cream Over the Counter\n\nAre you curious about the benefits of(...TRUNCATED) | {"url":"https://www.kartal24.com/en/exploring-the-benefits-of-lidocaine-cream-over-the-counter","sou(...TRUNCATED) | {"line_start_idx":[0,59,60,494,495,526,527,1374,1375,1413,1414,2129,2130,2185,2186,3118,3119,3182,31(...TRUNCATED) | {"red_pajama_v2":{"ccnet_original_length":5485.0,"ccnet_original_nlines":26.0,"rps_doc_curly_bracket(...TRUNCATED) | {"free_decimal_correspondence":{"primary":{"code":"615.6","labels":{"level_1":"Industrial arts, Tech(...TRUNCATED) | 95b707066e5b3184ba9f3885007f98f1 |
π¬ EAI-Taxonomy STEM w/ DCLM
π Website | π₯οΈ Code | π Paper
A high-quality STEM dataset curated from web data using taxonomy-based filtering, containing 1742 billion tokens of science, technology, engineering, and mathematics content.
π― Dataset Overview
This dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional STEM datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality STEM content.
π§ͺ EAI-Taxonomy STEM w/ DCLM (1742B tokens): Documents targeting science, engineering, medical, and computer science content that exhibit reasoning, combined with the DCLM classifier to filter for instruction-dense documents.
π Performance
Our taxonomy-based approach achieves superior results with significantly less curation effort:
| Dataset | MMLU-STEM | Curation Complexity |
|---|---|---|
| DCLM-baseline | 27.7% | General web filtering |
| FineWeb-Edu | 26.7% | Educational filtering |
| EAI-Taxonomy STEM | 29.1% | Simple semantic filter |
| EAI-Taxonomy STEM w/ DCLM | 34.5% | + DCLM classifier |
Results show +24.5% improvement over DCLM and +29.2% improvement over FineWeb-Edu.
π Key Findings
- Strong STEM Performance: Outperforms baseline and educational datasets beyond standard error
- Efficient Curation: Achieves superior results without complex domain-specific pipelines
- Broad Coverage: Encompasses science, engineering, medical, and computer science domains
- Quality Focus: Selects high-quality document types and filters for reasoning content
Dataset Schema Documentation
Overview
This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.
Core Fields
| Field | Type | Description | Path |
|---|---|---|---|
id |
Int64 |
Unique identifier based on document hash | id |
text |
String |
The main textual content of the document | text |
EAI Taxonomy Classification
Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.
How to Load the Dataset
This section provides examples of how to load the EssentialAI/eai-taxonomy-stem-w-dclm dataset using different Python libraries and frameworks.
Using Hugging Face Datasets (Standard Method)
The simplest way to load the dataset is using the Hugging Face datasets library:
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("EssentialAI/eai-taxonomy-stem-w-dclm")
# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")
You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:
from datasets import load_dataset
# Load in streaming mode
dataset = load_dataset("EssentialAI/eai-taxonomy-stem-w-dclm", streaming=True)
data_stream = dataset["train"]
# Iterate through examples
for example in data_stream.take(5):
print(example)
Using PySpark
For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:
# First install the required library:
# pip install pyspark_huggingface
import pyspark_huggingface
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-STEM-w-DCLM").getOrCreate()
# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/eai-taxonomy-stem-w-dclm")
# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()
# Load only specific columns for efficiency
df_subset = (
spark.read.format("huggingface")
.option("columns", '["column1", "column2"]') # Replace with actual column names
.load("EssentialAI/eai-taxonomy-stem-w-dclm")
)
# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_taxonomy_stem_w_dclm_dataset")
result = spark.sql("""
SELECT COUNT(*) as total_examples
FROM eai_taxonomy_stem_w_dclm_dataset
""")
result.show()
Using Daft
Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:
import daft
# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-stem-w-dclm")
# Basic exploration
print("Dataset schema:")
df.schema()
print("First 5 rows:")
df.show(5)
If you need to access private datasets or use authentication:
import daft
from daft.io import IOConfig, HTTPConfig
io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-stem-w-dclm", io_config=io_config)
Installation Requirements
Make sure you have the required libraries installed:
# For Hugging Face datasets
pip install datasets
# For PySpark with Hugging Face integration
pip install pyspark_huggingface
# For Daft
pip install daft
π License
Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.
π Citation
@misc{ai2025essentialwebv1024ttokens,
title={Essential-Web v1.0: 24T tokens of organized web data},
author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
year={2025},
eprint={2506.14111},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14111},
}
- Downloads last month
- 1,806
