VOOZH about

URL: https://huggingface.co/datasets/EssentialAI/eai-taxonomy-med-w-dclm

⇱ EssentialAI/eai-taxonomy-med-w-dclm · Datasets at Hugging Face


id
int64
-9,223,356,290,422,122,000
9,223,338,631B
text
stringlengths
232
1.03M
metadata
dict
line_start_n_end_idx
dict
quality_signals
dict
eai_taxonomy
dict
pid
stringclasses
2 values
5,272,971,680,948,108,000
Cindy R. Gunn Why you need to End up being Asleep in the Nude Why you need to End up being Asleep in the Nude . to possess improved sleep and you will overall health benefits as well as weight loss. What is their bed consistent? Might you clad your self for the comfortable pajamas? Sleep-in undergarments and you wi...
{ "url": "https://cindyrgunn.com/2022/06/29/why-you-need-to-end-up-being-asleep-in-the-nude/", "source_domain": "cindyrgunn.com", "snapshot_id": "CC-MAIN-2024-18", "warc_metadata": { "Content-Length": "119970", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:BX3M6KIC...
{ "line_start_idx": [ 0, 14, 15, 63, 64, 112, 113, 202, 203, 382, 383, 586, 587, 958, 959, 1493, 1494, 1954, 1955, 2000, 2001, 2764, 2765, 4213, 4214, 4806, 4807, 5084, 5473, 5922, 6184, 618...
{ "red_pajama_v2": { "ccnet_original_length": 6311, "ccnet_original_nlines": 31, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 7, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.46134868264198303, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.00575657980...
{ "free_decimal_correspondence": { "primary": { "code": "613.69", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Health and Hygiene" } }, "secondary": { "code": "613.7", "labels": { "le...
f177b5043c75ce8646dc8ec41dbca083
-6,427,989,513,668,166,000
Article Text Original research Src/lck inhibitor dasatinib reversibly switches off cytokine release and T cell cytotoxicity following stimulation with T cell bispecific antibodies 1. Gabrielle Leclercq1,2, 2. Hélène Haegel1, 3. Anneliese Schneider1, 4. Anna Maria Giusti1, 5. Estelle Marrer-Berger3, 6. Chri...
{ "url": "https://jitc.bmj.com/content/9/7/e002582", "source_domain": "jitc.bmj.com", "snapshot_id": "CC-MAIN-2023-23", "warc_metadata": { "Content-Length": "285903", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:U4MABL3NNGO5OUF7GYPVGXO6H3T6Y24A", "WARC-Concurr...
{ "line_start_idx": [ 0, 13, 14, 32, 181, 209, 230, 257, 282, 311, 337, 365, 384, 404, 426, 452, 474, 494, 518, 541, 647, 778, 879, 954, 955, 964, 965, 1837, 1838, 2639, 2640, 3095, 3096...
{ "red_pajama_v2": { "ccnet_original_length": 47864, "ccnet_original_nlines": 262, "rps_doc_curly_bracket": 0.000041790000977925956, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.2538166046142578, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_cap...
{ "free_decimal_correspondence": { "primary": { "code": "615.5076", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Materia medica, Drugs, and Pharmacy" } }, "secondary": { "code": "615.507", "l...
f177b5043c75ce8646dc8ec41dbca083
5,867,576,976,267,964,000
Skip to main content Table 3 Survival for the Patients with high SCC level From: Preoperative SCC-Ag as a predictive marker for the use of adjuvant chemotherapy in cervical squamous cell carcinoma with intermediate-risk factors Group Adjuvant chemo-radiotherapy (n = 84) Adjuvant radiotherapy (n = 67) p value 3-y...
{ "url": "https://bmccancer.biomedcentral.com/articles/10.1186/s12885-020-06928-9/tables/3", "source_domain": "bmccancer.biomedcentral.com", "snapshot_id": "CC-MAIN-2023-23", "warc_metadata": { "Content-Length": "215152", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sh...
{ "line_start_idx": [ 0, 21, 22, 76, 77, 230, 231, 237, 238, 275, 276, 307, 308, 316, 317, 324, 325, 332, 333, 340, 341, 348, 349, 352, 353, 360, 361, 368, 369, 376, 377, 384, 385, 3...
{ "red_pajama_v2": { "ccnet_original_length": 544, "ccnet_original_nlines": 47, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.13533835113048553, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.045112781226...
{ "free_decimal_correspondence": { "primary": { "code": "616.99442", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Pathology and Diseases" } }, "secondary": { "code": "616.9944", "labels": { ...
f177b5043c75ce8646dc8ec41dbca083
7,740,751,997,318,403,000
Skip to content Herbal Medicine How to Treat Graves Disease With Acupuncture and TCM Share By Qineng Tan, L.Ac., Ph.D. and Xiaomei Cai, L.Ac., Ph.D.   checking for goiter Checking for goiter, or enlarged thyroid gland. Goiter? Bulging eyes? Red eyes, eye pain? Feeling anxious and irritable? Hand tremor? These ca...
{ "url": "https://myartofwellness.com/category/herbal-medicine/", "source_domain": "myartofwellness.com", "snapshot_id": "CC-MAIN-2023-50", "warc_metadata": { "Content-Length": "187334", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:6NLJVZNWFZ3FOPPMN6K6TR44BUSLPQFF...
{ "line_start_idx": [ 0, 16, 17, 33, 34, 87, 88, 94, 95, 153, 154, 156, 157, 177, 225, 226, 468, 469, 674, 675, 934, 935, 1227, 1228, 1454, 1455, 1457, 1458, 1486, 1487, 1725, 1726, 1996...
{ "red_pajama_v2": { "ccnet_original_length": 42251, "ccnet_original_nlines": 557, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3590516149997711, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.0210898593...
{ "free_decimal_correspondence": { "primary": { "code": "615.5", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Materia medica, Drugs, and Pharmacy" } }, "secondary": { "code": "615.857", "labe...
f177b5043c75ce8646dc8ec41dbca083
-9,010,945,881,264,296,000
 Health Benefits of Cranberry Juice You may have heard that drinking cranberry juice can help with a urinary tract infection (UTI), but that’s not the only benefit. It is also beneficial in preventing stomach disorders and diabetes, as well as gum diseases caused by dental plaque. Phytonutrients, which are naturally...
{ "url": "https://www.sandybook.in/latestsms/health-is-wealth/923-health-benefits-of-cranberry-juice", "source_domain": "www.sandybook.in", "snapshot_id": "CC-MAIN-2023-14", "warc_metadata": { "Content-Length": "55406", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1...
{ "line_start_idx": [ 0, 2, 3, 38, 39, 441, 442, 685, 686, 888, 889, 1025, 1026, 1028, 1029, 1054, 1055, 1247, 1248, 1520, 1521, 1595, 1596, 1620, 1621, 1947, 1948, 2192, 2193, 2230, 2231, 2...
{ "red_pajama_v2": { "ccnet_original_length": 11279, "ccnet_original_nlines": 151, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.35223281383514404, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.009533369...
{ "free_decimal_correspondence": { "primary": { "code": "613.2", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Health and Hygiene" } }, "secondary": { "code": "615.5", "labels": { "lev...
f177b5043c75ce8646dc8ec41dbca083
3,450,643,204,740,642,300
Book Review #4: The Body Keeps the Score. book review Apr 26, 2020 “IT IS NOT THAT SOMETHING DIFFERENT IS SEEN, BUT THAT ONE SEES DIFFERENTLY." – CARL JUNG The Body Keeps The Score: Brain, Mind, and Body in the Healing of Trauma. By Dr. Bessel van der Kolk. Penguin Books, New York, NY. (2014) The most effective coa...
{ "url": "https://www.michelleboland-training.com/blog/book-review-4-the-body-keeps-the-score", "source_domain": "www.michelleboland-training.com", "snapshot_id": "CC-MAIN-2024-26", "warc_metadata": { "Content-Length": "33954", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest...
{ "line_start_idx": [ 0, 42, 43, 68, 69, 158, 159, 297, 298, 732, 733, 990, 991, 1363, 1364, 1839, 1840, 2400, 2401, 2433, 2434, 2459, 2460, 2844, 2845, 3229, 3230, 3656, 3657, 3918, 3919, 3...
{ "red_pajama_v2": { "ccnet_original_length": 9474, "ccnet_original_nlines": 80, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 1, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3502509891986847, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.067484661936...
{ "free_decimal_correspondence": { "primary": { "code": "616.858", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Pathology and Diseases" } }, "secondary": { "code": "612.82", "labels": { ...
f177b5043c75ce8646dc8ec41dbca083
4,927,777,015,434,544,000
CBD Oil Strength Explained Aug 17, 2021 | CBD Oils If you browse our CBD oils, you’ll find a range of strengths, from 300 mg CBD Tinctures to 1500 mg CBDA Tinctures. Before you begin a CBD regimen, make sure you understand what those strength numbers refer to. Read the Labels Carefully You need to know how much CBD ...
{ "url": "https://southerncomfortwellness.com/cbd-oil-strength-explained/", "source_domain": "southerncomfortwellness.com", "snapshot_id": "CC-MAIN-2023-40", "warc_metadata": { "Content-Length": "329445", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:AALLNWSAGSR5VJ...
{ "line_start_idx": [ 0, 27, 28, 52, 53, 263, 264, 290, 648, 649, 669, 670, 1117, 1118, 1148, 1149, 1566, 1567, 1719, 1720, 1754, 1755, 1974, 1975, 1991, 1992, 2206, 2368, 2548, 2549, 2646, ...
{ "red_pajama_v2": { "ccnet_original_length": 9314, "ccnet_original_nlines": 137, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3953116834163666, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.04901437833...
{ "free_decimal_correspondence": { "primary": { "code": "615.3", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Materia medica, Drugs, and Pharmacy" } }, "secondary": { "code": "615.9", "labels...
f177b5043c75ce8646dc8ec41dbca083
-747,280,241,509,980,800
Are you a physician? Go to Physician Area Get Started The Symptoms of Basal Cell Carcinoma The symptoms of basal cell carcinoma occasionally resemble the features of non-cancerous skin conditions like psoriasis or eczema. Only a trained physician or specialist can decide for sure if it is basal cell carcinoma. If yo...
{ "url": "https://sensushealthcare.com/symptoms-basal-cell-carcinoma/", "source_domain": "sensushealthcare.com", "snapshot_id": "crawl=CC-MAIN-2022-33", "warc_metadata": { "Content-Length": "171429", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:5SGXOKWKCA6FKY5P3DW...
{ "line_start_idx": [ 0, 21, 22, 43, 55, 56, 93, 94, 392, 393, 447, 448, 462, 463, 691, 692, 719, 720, 986, 987, 1013, 1014, 1261, 1262, 1282, 1283, 1488, 1489, 1507, 1508, 1773, 1774, 1...
{ "red_pajama_v2": { "ccnet_original_length": 2471, "ccnet_original_nlines": 39, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.3441295623779297, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0.016194330528...
{ "free_decimal_correspondence": { "primary": { "code": "616.292", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Pathology and Diseases" } }, "secondary": { "code": "616.29", "labels": { ...
f177b5043c75ce8646dc8ec41dbca083
-5,771,397,859,995,973,000
Mon. Jun 24th, 2024 Playing video games for long hours can cause vision-related problems in kids such as eye strain or computer vision syndrome. When your eyes are exposed to excessive screen time, you may experience symptoms such as eye fatigue, blurry vision, eye discomfort and headaches. Kids get so addicted to vi...
{ "url": "https://cybersectors.com/can-video-games-ruin-your-eye-health/", "source_domain": "cybersectors.com", "snapshot_id": "CC-MAIN-2024-26", "warc_metadata": { "Content-Length": "114972", "Content-Type": "application/http; msgtype=response", "WARC-Block-Digest": "sha1:RCYJNRTAV76ISF7XIFBJNBQWQ3...
{ "line_start_idx": [ 0, 20, 21, 293, 294, 514, 515, 552, 553, 910, 911, 1073, 1074, 1415, 1416, 1455, 1456, 1734, 1735, 1939, 1940, 2224, 2225, 2347, 2348, 2366, 2367, 2585, 2586, 2785, 2786, ...
{ "red_pajama_v2": { "ccnet_original_length": 4075, "ccnet_original_nlines": 42, "rps_doc_curly_bracket": 0, "rps_doc_ldnoobw_words": 0, "rps_doc_lorem_ipsum": 0, "rps_doc_stop_word_fraction": 0.4403443932533264, "rps_doc_ut1_blacklist": 0, "rps_doc_frac_all_caps_words": 0, "rps_do...
{ "free_decimal_correspondence": { "primary": { "code": "617.7", "labels": { "level_1": "Industrial arts, Technology, and Engineering", "level_2": "Medicine", "level_3": "Surgery and Dentistry" } }, "secondary": { "code": "617.8", "labels": { "...
f177b5043c75ce8646dc8ec41dbca083
-6,488,302,194,022,573,000
"If You Think You Get Resources, Then Read This\n\nImportant Information On Ketosis\n\nPeople who ar(...TRUNCATED)
{"url":"https://www.witchhunteronline.com/if-you-think-you-get-resources-then-read-this/","source_do(...TRUNCATED)
{"line_start_idx":[0,47,48,81,82,437,438,923,924,1502,1503,1857,1858,2325,2326,2599,2600,2640,2641],(...TRUNCATED)
{"red_pajama_v2":{"ccnet_original_length":2678.0,"ccnet_original_nlines":18.0,"rps_doc_curly_bracket(...TRUNCATED)
{"free_decimal_correspondence":{"primary":{"code":"613.2","labels":{"level_1":"Industrial arts, Tech(...TRUNCATED)
f177b5043c75ce8646dc8ec41dbca083
End of preview. Expand in Data Studio

🏥 Taxonomy Med w/ DCLM

🏆 Website | 🖥️ Code | 📖 Paper

A high-quality medical dataset curated from web data using taxonomy-based filtering, containing 205 billion tokens of medical content.

🎯 Dataset Overview

This dataset is part of the Essential-Web project, which introduces a new paradigm for dataset curation using expressive metadata and simple semantic filters. Unlike traditional medical datasets that require complex domain-specific pipelines, our approach leverages a 12-category taxonomy to efficiently identify and extract high-quality medical content.

🔬 EAI-Taxonomy Med w/ DCLM (205B tokens): Documents targeting scientific medical content that exhibit reasoning and are technically correct, combined with the DCLM classifier to filter for instruction-dense documents.

🏆 Performance

Our taxonomy-based approach achieves superior results with significantly less curation effort:

Dataset CareQA-en MedMCQA MedQA-USMLE PubMedQA MMLU-Med Curation Complexity
DCLM-baseline 26.9% 31.6% 25.9% 70.6% 31.0% General web filtering
TheBlueScrubs-v1 25.1% 32.2% 25.3% 69.2% 25.7% Complex domain pipeline
EAI-Taxonomy Med 27.7% 32.5% 28.1% 67.0% 29.5% Simple semantic filter
EAI-Taxonomy Med w/ DCLM 31.5% 32.7% 30.1% 68.6% 39.2% + DCLM classifier

🔍 Key Findings

  • Robust Performance: Achieves best or near-best performance across all medical evaluations
  • Above Random Performance: Successfully performs above chance (~25%) on MedQA-USMLE where baseline methods fail
  • Consistent Improvements: +13.8% average improvement over existing specialized medical datasets
  • Efficiency: Strong medical knowledge without complex domain-specific curation pipelines

Dataset Schema Documentation

Overview

This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.

Core Fields

Field Type Description Path
id Int64 Unique identifier based on document hash id
text String The main textual content of the document text

EAI Taxonomy Classification

Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.

How to Load the Dataset

This section provides examples of how to load the EssentialAI/eai-taxonomy-med-w-dclm dataset using different Python libraries and frameworks.

Using Hugging Face Datasets (Standard Method)

The simplest way to load the dataset is using the Hugging Face datasets library:

from datasets import load_dataset

# Load the entire dataset
dataset = load_dataset("EssentialAI/eai-taxonomy-med-w-dclm")

# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")

You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:

from datasets import load_dataset

# Load in streaming mode
dataset = load_dataset("EssentialAI/eai-taxonomy-med-w-dclm", streaming=True)
data_stream = dataset["train"]

# Iterate through examples
for example in data_stream.take(5):
 print(example)

Using PySpark

For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:

# First install the required library:
# pip install pyspark_huggingface

import pyspark_huggingface
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-Med-w-DCLM").getOrCreate()

# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/eai-taxonomy-med-w-dclm")

# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()

# Load only specific columns for efficiency
df_subset = (
 spark.read.format("huggingface")
 .option("columns", '["column1", "column2"]') # Replace with actual column names
 .load("EssentialAI/eai-taxonomy-med-w-dclm")
)

# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_taxonomy_med_w_dclm_dataset")
result = spark.sql("""
 SELECT COUNT(*) as total_examples
 FROM eai_taxonomy_med_w_dclm_dataset
""")
result.show()

Using Daft

Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:

import daft

# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-med-w-dclm")

# Basic exploration
print("Dataset schema:")
df.schema()

print("First 5 rows:")
df.show(5)

If you need to access private datasets or use authentication:

import daft
from daft.io import IOConfig, HTTPConfig

io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/eai-taxonomy-med-w-dclm", io_config=io_config)

Installation Requirements

Make sure you have the required libraries installed:

# For Hugging Face datasets
pip install datasets

# For PySpark with Hugging Face integration
pip install pyspark_huggingface

# For Daft
pip install daft

📜 License

Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.

📝 Citation

@misc{ai2025essentialwebv1024ttokens,
 title={Essential-Web v1.0: 24T tokens of organized web data}, 
 author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
 year={2025},
 eprint={2506.14111},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2506.14111}, 
}
Downloads last month
492

Collection including EssentialAI/eai-taxonomy-med-w-dclm

Paper for EssentialAI/eai-taxonomy-med-w-dclm