id int64 -9,223,356,290,422,122,000 9,223,338,631B | text stringlengths 232 1.03M | metadata dict | line_start_n_end_idx dict | quality_signals dict | eai_taxonomy dict | pid stringclasses 2
values |
|---|---|---|---|---|---|---|
5,272,971,680,948,108,000 | Cindy R. Gunn
Why you need to End up being Asleep in the Nude
Why you need to End up being Asleep in the Nude
. to possess improved sleep and you will overall health benefits as well as weight loss.
What is their bed consistent? Might you clad your self for the comfortable pajamas? Sleep-in undergarments and you wi... | {
"url": "https://cindyrgunn.com/2022/06/29/why-you-need-to-end-up-being-asleep-in-the-nude/",
"source_domain": "cindyrgunn.com",
"snapshot_id": "CC-MAIN-2024-18",
"warc_metadata": {
"Content-Length": "119970",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:BX3M6KIC... | {
"line_start_idx": [
0,
14,
15,
63,
64,
112,
113,
202,
203,
382,
383,
586,
587,
958,
959,
1493,
1494,
1954,
1955,
2000,
2001,
2764,
2765,
4213,
4214,
4806,
4807,
5084,
5473,
5922,
6184,
618... | {
"red_pajama_v2": {
"ccnet_original_length": 6311,
"ccnet_original_nlines": 31,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 7,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.46134868264198303,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.00575657980... | {
"free_decimal_correspondence": {
"primary": {
"code": "613.69",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Health and Hygiene"
}
},
"secondary": {
"code": "613.7",
"labels": {
"le... | f177b5043c75ce8646dc8ec41dbca083 |
-6,427,989,513,668,166,000 | Article Text
Original research
Src/lck inhibitor dasatinib reversibly switches off cytokine release and T cell cytotoxicity following stimulation with T cell bispecific antibodies
1. Gabrielle Leclercq1,2,
2. Hélène Haegel1,
3. Anneliese Schneider1,
4. Anna Maria Giusti1,
5. Estelle Marrer-Berger3,
6. Chri... | {
"url": "https://jitc.bmj.com/content/9/7/e002582",
"source_domain": "jitc.bmj.com",
"snapshot_id": "CC-MAIN-2023-23",
"warc_metadata": {
"Content-Length": "285903",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:U4MABL3NNGO5OUF7GYPVGXO6H3T6Y24A",
"WARC-Concurr... | {
"line_start_idx": [
0,
13,
14,
32,
181,
209,
230,
257,
282,
311,
337,
365,
384,
404,
426,
452,
474,
494,
518,
541,
647,
778,
879,
954,
955,
964,
965,
1837,
1838,
2639,
2640,
3095,
3096... | {
"red_pajama_v2": {
"ccnet_original_length": 47864,
"ccnet_original_nlines": 262,
"rps_doc_curly_bracket": 0.000041790000977925956,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.2538166046142578,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_cap... | {
"free_decimal_correspondence": {
"primary": {
"code": "615.5076",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Materia medica, Drugs, and Pharmacy"
}
},
"secondary": {
"code": "615.507",
"l... | f177b5043c75ce8646dc8ec41dbca083 |
5,867,576,976,267,964,000 | Skip to main content
Table 3 Survival for the Patients with high SCC level
From: Preoperative SCC-Ag as a predictive marker for the use of adjuvant chemotherapy in cervical squamous cell carcinoma with intermediate-risk factors
Group
Adjuvant chemo-radiotherapy (n = 84)
Adjuvant radiotherapy (n = 67)
p value
3-y... | {
"url": "https://bmccancer.biomedcentral.com/articles/10.1186/s12885-020-06928-9/tables/3",
"source_domain": "bmccancer.biomedcentral.com",
"snapshot_id": "CC-MAIN-2023-23",
"warc_metadata": {
"Content-Length": "215152",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sh... | {
"line_start_idx": [
0,
21,
22,
76,
77,
230,
231,
237,
238,
275,
276,
307,
308,
316,
317,
324,
325,
332,
333,
340,
341,
348,
349,
352,
353,
360,
361,
368,
369,
376,
377,
384,
385,
3... | {
"red_pajama_v2": {
"ccnet_original_length": 544,
"ccnet_original_nlines": 47,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.13533835113048553,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.045112781226... | {
"free_decimal_correspondence": {
"primary": {
"code": "616.99442",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Pathology and Diseases"
}
},
"secondary": {
"code": "616.9944",
"labels": {
... | f177b5043c75ce8646dc8ec41dbca083 |
7,740,751,997,318,403,000 | Skip to content
Herbal Medicine
How to Treat Graves Disease With Acupuncture and TCM
Share
By Qineng Tan, L.Ac., Ph.D. and Xiaomei Cai, L.Ac., Ph.D.
checking for goiter
Checking for goiter, or enlarged thyroid gland.
Goiter? Bulging eyes? Red eyes, eye pain? Feeling anxious and irritable? Hand tremor? These ca... | {
"url": "https://myartofwellness.com/category/herbal-medicine/",
"source_domain": "myartofwellness.com",
"snapshot_id": "CC-MAIN-2023-50",
"warc_metadata": {
"Content-Length": "187334",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:6NLJVZNWFZ3FOPPMN6K6TR44BUSLPQFF... | {
"line_start_idx": [
0,
16,
17,
33,
34,
87,
88,
94,
95,
153,
154,
156,
157,
177,
225,
226,
468,
469,
674,
675,
934,
935,
1227,
1228,
1454,
1455,
1457,
1458,
1486,
1487,
1725,
1726,
1996... | {
"red_pajama_v2": {
"ccnet_original_length": 42251,
"ccnet_original_nlines": 557,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3590516149997711,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0210898593... | {
"free_decimal_correspondence": {
"primary": {
"code": "615.5",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Materia medica, Drugs, and Pharmacy"
}
},
"secondary": {
"code": "615.857",
"labe... | f177b5043c75ce8646dc8ec41dbca083 |
-9,010,945,881,264,296,000 |
Health Benefits of Cranberry Juice
You may have heard that drinking cranberry juice can help with a urinary tract infection (UTI), but that’s not the only benefit. It is also beneficial in preventing stomach disorders and diabetes, as well as gum diseases caused by dental plaque. Phytonutrients, which are naturally... | {
"url": "https://www.sandybook.in/latestsms/health-is-wealth/923-health-benefits-of-cranberry-juice",
"source_domain": "www.sandybook.in",
"snapshot_id": "CC-MAIN-2023-14",
"warc_metadata": {
"Content-Length": "55406",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1... | {
"line_start_idx": [
0,
2,
3,
38,
39,
441,
442,
685,
686,
888,
889,
1025,
1026,
1028,
1029,
1054,
1055,
1247,
1248,
1520,
1521,
1595,
1596,
1620,
1621,
1947,
1948,
2192,
2193,
2230,
2231,
2... | {
"red_pajama_v2": {
"ccnet_original_length": 11279,
"ccnet_original_nlines": 151,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.35223281383514404,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.009533369... | {
"free_decimal_correspondence": {
"primary": {
"code": "613.2",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Health and Hygiene"
}
},
"secondary": {
"code": "615.5",
"labels": {
"lev... | f177b5043c75ce8646dc8ec41dbca083 |
3,450,643,204,740,642,300 | Book Review #4: The Body Keeps the Score.
book review Apr 26, 2020
“IT IS NOT THAT SOMETHING DIFFERENT IS SEEN, BUT THAT ONE SEES DIFFERENTLY." – CARL JUNG
The Body Keeps The Score: Brain, Mind, and Body in the Healing of Trauma. By Dr. Bessel van der Kolk. Penguin Books, New York, NY. (2014)
The most effective coa... | {
"url": "https://www.michelleboland-training.com/blog/book-review-4-the-body-keeps-the-score",
"source_domain": "www.michelleboland-training.com",
"snapshot_id": "CC-MAIN-2024-26",
"warc_metadata": {
"Content-Length": "33954",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest... | {
"line_start_idx": [
0,
42,
43,
68,
69,
158,
159,
297,
298,
732,
733,
990,
991,
1363,
1364,
1839,
1840,
2400,
2401,
2433,
2434,
2459,
2460,
2844,
2845,
3229,
3230,
3656,
3657,
3918,
3919,
3... | {
"red_pajama_v2": {
"ccnet_original_length": 9474,
"ccnet_original_nlines": 80,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 1,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3502509891986847,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.067484661936... | {
"free_decimal_correspondence": {
"primary": {
"code": "616.858",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Pathology and Diseases"
}
},
"secondary": {
"code": "612.82",
"labels": {
... | f177b5043c75ce8646dc8ec41dbca083 |
4,927,777,015,434,544,000 | CBD Oil Strength Explained
Aug 17, 2021 | CBD Oils
If you browse our CBD oils, you’ll find a range of strengths, from 300 mg CBD Tinctures to 1500 mg CBDA Tinctures. Before you begin a CBD regimen, make sure you understand what those strength numbers refer to.
Read the Labels Carefully
You need to know how much CBD ... | {
"url": "https://southerncomfortwellness.com/cbd-oil-strength-explained/",
"source_domain": "southerncomfortwellness.com",
"snapshot_id": "CC-MAIN-2023-40",
"warc_metadata": {
"Content-Length": "329445",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:AALLNWSAGSR5VJ... | {
"line_start_idx": [
0,
27,
28,
52,
53,
263,
264,
290,
648,
649,
669,
670,
1117,
1118,
1148,
1149,
1566,
1567,
1719,
1720,
1754,
1755,
1974,
1975,
1991,
1992,
2206,
2368,
2548,
2549,
2646,
... | {
"red_pajama_v2": {
"ccnet_original_length": 9314,
"ccnet_original_nlines": 137,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3953116834163666,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.04901437833... | {
"free_decimal_correspondence": {
"primary": {
"code": "615.3",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Materia medica, Drugs, and Pharmacy"
}
},
"secondary": {
"code": "615.9",
"labels... | f177b5043c75ce8646dc8ec41dbca083 |
-747,280,241,509,980,800 | Are you a physician?
Go to Physician Area
Get Started
The Symptoms of Basal Cell Carcinoma
The symptoms of basal cell carcinoma occasionally resemble the features of non-cancerous skin conditions like psoriasis or eczema. Only a trained physician or specialist can decide for sure if it is basal cell carcinoma. If yo... | {
"url": "https://sensushealthcare.com/symptoms-basal-cell-carcinoma/",
"source_domain": "sensushealthcare.com",
"snapshot_id": "crawl=CC-MAIN-2022-33",
"warc_metadata": {
"Content-Length": "171429",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:5SGXOKWKCA6FKY5P3DW... | {
"line_start_idx": [
0,
21,
22,
43,
55,
56,
93,
94,
392,
393,
447,
448,
462,
463,
691,
692,
719,
720,
986,
987,
1013,
1014,
1261,
1262,
1282,
1283,
1488,
1489,
1507,
1508,
1773,
1774,
1... | {
"red_pajama_v2": {
"ccnet_original_length": 2471,
"ccnet_original_nlines": 39,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3441295623779297,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.016194330528... | {
"free_decimal_correspondence": {
"primary": {
"code": "616.292",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Pathology and Diseases"
}
},
"secondary": {
"code": "616.29",
"labels": {
... | f177b5043c75ce8646dc8ec41dbca083 |
-5,771,397,859,995,973,000 | Mon. Jun 24th, 2024
Playing video games for long hours can cause vision-related problems in kids such as eye strain or computer vision syndrome. When your eyes are exposed to excessive screen time, you may experience symptoms such as eye fatigue, blurry vision, eye discomfort and headaches.
Kids get so addicted to vi... | {
"url": "https://cybersectors.com/can-video-games-ruin-your-eye-health/",
"source_domain": "cybersectors.com",
"snapshot_id": "CC-MAIN-2024-26",
"warc_metadata": {
"Content-Length": "114972",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:RCYJNRTAV76ISF7XIFBJNBQWQ3... | {
"line_start_idx": [
0,
20,
21,
293,
294,
514,
515,
552,
553,
910,
911,
1073,
1074,
1415,
1416,
1455,
1456,
1734,
1735,
1939,
1940,
2224,
2225,
2347,
2348,
2366,
2367,
2585,
2586,
2785,
2786,
... | {
"red_pajama_v2": {
"ccnet_original_length": 4075,
"ccnet_original_nlines": 42,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.4403443932533264,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0,
"rps_do... | {
"free_decimal_correspondence": {
"primary": {
"code": "617.7",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Medicine",
"level_3": "Surgery and Dentistry"
}
},
"secondary": {
"code": "617.8",
"labels": {
"... | f177b5043c75ce8646dc8ec41dbca083 |
-6,488,302,194,022,573,000 | "If You Think You Get Resources, Then Read This\n\nImportant Information On Ketosis\n\nPeople who ar(...TRUNCATED) | {"url":"https://www.witchhunteronline.com/if-you-think-you-get-resources-then-read-this/","source_do(...TRUNCATED) | {"line_start_idx":[0,47,48,81,82,437,438,923,924,1502,1503,1857,1858,2325,2326,2599,2600,2640,2641],(...TRUNCATED) | {"red_pajama_v2":{"ccnet_original_length":2678.0,"ccnet_original_nlines":18.0,"rps_doc_curly_bracket(...TRUNCATED) | {"free_decimal_correspondence":{"primary":{"code":"613.2","labels":{"level_1":"Industrial arts, Tech(...TRUNCATED) | f177b5043c75ce8646dc8ec41dbca083 |
🏥 Taxonomy Med w/ DCLM
A high-quality medical dataset curated from web data using taxonomy-based filtering, containing 205 billion tokens of medical content.
🎯 Dataset Overview
This dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional medical datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality medical content.
🔬 EAI-Taxonomy Med w/ DCLM (205B tokens): Documents targeting scientific medical content that exhibit reasoning and are technically correct, combined with the DCLM classifier to filter for instruction-dense documents.
🏆 Performance
Our taxonomy-based approach achieves superior results with significantly less curation effort:
| Dataset | CareQA-en | MedMCQA | MedQA-USMLE | PubMedQA | MMLU-Med | Curation Complexity |
|---|---|---|---|---|---|---|
| DCLM-baseline | 26.9% | 31.6% | 25.9% | 70.6% | 31.0% | General web filtering |
| TheBlueScrubs-v1 | 25.1% | 32.2% | 25.3% | 69.2% | 25.7% | Complex domain pipeline |
| EAI-Taxonomy Med | 27.7% | 32.5% | 28.1% | 67.0% | 29.5% | Simple semantic filter |
| EAI-Taxonomy Med w/ DCLM | 31.5% | 32.7% | 30.1% | 68.6% | 39.2% | + DCLM classifier |
🔍 Key Findings
- Robust Performance: Achieves best or near-best performance across all medical evaluations
- Above Random Performance: Successfully performs above chance (~25%) on MedQA-USMLE where baseline methods fail
- Consistent Improvements: +13.8% average improvement over existing specialized medical datasets
- Efficiency: Strong medical knowledge without complex domain-specific curation pipelines
Dataset Schema Documentation
Overview
This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.
Core Fields
| Field | Type | Description | Path |
|---|---|---|---|
id |
Int64 |
Unique identifier based on document hash | id |
text |
String |
The main textual content of the document | text |
EAI Taxonomy Classification
Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.
How to Load the Dataset
This section provides examples of how to load the EssentialAI/eai-taxonomy-med-w-dclm dataset using different Python libraries and frameworks.
Using Hugging Face Datasets (Standard Method)
The simplest way to load the dataset is using the Hugging Face datasets library:
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("EssentialAI/eai-taxonomy-med-w-dclm")
# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")
You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:
from datasets import load_dataset
# Load in streaming mode
dataset = load_dataset("EssentialAI/eai-taxonomy-med-w-dclm", streaming=True)
data_stream = dataset["train"]
# Iterate through examples
for example in data_stream.take(5):
print(example)
Using PySpark
For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:
# First install the required library:
# pip install pyspark_huggingface
import pyspark_huggingface
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-Med-w-DCLM").getOrCreate()
# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/eai-taxonomy-med-w-dclm")
# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()
# Load only specific columns for efficiency
df_subset = (
spark.read.format("huggingface")
.option("columns", '["column1", "column2"]') # Replace with actual column names
.load("EssentialAI/eai-taxonomy-med-w-dclm")
)
# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_taxonomy_med_w_dclm_dataset")
result = spark.sql("""
SELECT COUNT(*) as total_examples
FROM eai_taxonomy_med_w_dclm_dataset
""")
result.show()
Using Daft
Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:
import daft
# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-med-w-dclm")
# Basic exploration
print("Dataset schema:")
df.schema()
print("First 5 rows:")
df.show(5)
If you need to access private datasets or use authentication:
import daft
from daft.io import IOConfig, HTTPConfig
io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-med-w-dclm", io_config=io_config)
Installation Requirements
Make sure you have the required libraries installed:
# For Hugging Face datasets
pip install datasets
# For PySpark with Hugging Face integration
pip install pyspark_huggingface
# For Daft
pip install daft
📜 License
Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.
📝 Citation
@misc{ai2025essentialwebv1024ttokens,
title={Essential-Web v1.0: 24T tokens of organized web data},
author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
year={2025},
eprint={2506.14111},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14111},
}
- Downloads last month
- 492
