id int64 | text string | metadata dict | line_start_n_end_idx dict | quality_signals dict | eai_taxonomy dict | pid string |
|---|---|---|---|---|---|---|
-3,908,994,749,044,929,500 | Wednesday, May 25, 2011
NEVER GROW UP
These are two persons I am always happy to meet, not only because they have amazing style.
The bubbles were in front of Monki's Helsinki store last Saturday, when Monki and Weekday were celebrating their one year in Helsinki.
I was an extra at Monki party in the evening and I ... | {
"url": "http://indielovescake.blogspot.com/2011_05_01_archive.html",
"source_domain": "indielovescake.blogspot.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "97639",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:CORHYPXHOA3CG5... | {
"line_start_idx": [
0,
24,
25,
39,
40,
41,
42,
133,
134,
269,
383,
470,
589,
663,
664,
665,
666,
690,
691,
702,
703,
704,
705,
706,
766,
767,
824,
1069,
1177,
1262,
1263,
1286,
1287,
... | {
"red_pajama_v2": {
"ccnet_original_length": 3176,
"ccnet_original_nlines": 93,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.35227271914482117,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.078125,
... | {
"free_decimal_correspondence": {
"primary": {
"code": "746.92",
"labels": {
"level_1": "Arts",
"level_2": "Drawing, Decoration and ornament, and Design",
"level_3": "Needlework and Fancy work"
}
},
"secondary": {
"code": "306.482",
"labels": {
... | 63cf76e62422470f51c1029c86092532 |
6,279,127,140,263,448,000 | Foobooz - Your guide to food and drink in Philadelphia
• Neighborhoods
• Features
•
• Tip Jar
Have a food or drink tip? tips@foobooz.com (AIM:foobooz)
• Opening Soon
• Upcoming Events
• Categories
• Archives
• Masthead
• Fun Things To Do in Philly
• Subscribe
Speck Food and ... | {
"url": "http://philadelphia.foobooz.com/tag/speck-food-and-wine/",
"source_domain": "philadelphia.foobooz.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "107958",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:JOEZHX4RY2WDSBAYSP... | {
"line_start_idx": [
0,
55,
73,
74,
80,
93,
94,
100,
112,
113,
174,
175,
192,
193,
213,
214,
229,
230,
243,
244,
257,
258,
289,
290,
304,
305,
325,
326,
345,
346,
382,
383,
544,
545... | {
"red_pajama_v2": {
"ccnet_original_length": 6602,
"ccnet_original_nlines": 188,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.25222551822662354,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0051928800... | {
"free_decimal_correspondence": {
"primary": {
"code": "647.95",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Home economics",
"level_3": "Household employees, Caterers and catering, and Real estate management"
}
},
"secondar... | 63cf76e62422470f51c1029c86092532 |
-3,179,952,510,865,221,000 | Just $125.00 away from free shipping!
Details
L/S Halftime Henley
Item# LCK02395
$50.00
$37.50
Pick your team color and gear up for the game in this sporty Halftime Henley. Features 1x1 rib knit, Henley neckline, scoop neck, six-button placket, natural shoulder, piecing at hem, cuffed sleeves and metal buttons. 9... | {
"url": "http://www.cutterbuck.com/holiday-shop/women/tops/l-s-halftime-henley-105.html",
"source_domain": "www.cutterbuck.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "131068",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:7H... | {
"line_start_idx": [
0,
38,
39,
47,
48,
68,
69,
84,
85,
92,
93,
100,
101,
367,
379,
380,
398,
399,
419,
447,
473,
481,
482,
494,
495,
516,
517,
522,
533
],
"line_end_idx": [
38,
39,
47,... | {
"red_pajama_v2": {
"ccnet_original_length": 537,
"ccnet_original_nlines": 28,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.1043478325009346,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0347826108336... | {
"free_decimal_correspondence": {
"primary": {
"code": "646.7",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Home economics",
"level_3": "Clothing and dress, Fashion, and Beauty, Personal"
}
},
"secondary": {
"code": "6... | 63cf76e62422470f51c1029c86092532 |
-230,021,332,694,416,350 | Top Sources
By Region
Classifieds
Reviews in history
Reviews of significant work in all fields of historical interest. Sign up for email alerts
history.ac.uk
BBIH: a new bibliography
Search over 500,000 books and articles about British and Irish history in the new BBIH
history.ac.uk
Latest questions
Ebenezer Chap... | {
"url": "http://www.british-history.ac.uk/mapsheet.aspx?sheetid=1526&compid=55183",
"source_domain": "www.british-history.ac.uk",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "35576",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:LZ... | {
"line_start_idx": [
0,
12,
13,
23,
24,
25,
37,
38,
57,
148,
162,
187,
274,
288,
289,
306,
307,
377,
440,
518,
519,
596,
597,
602,
603,
652,
653,
670,
671,
676,
677,
686,
687,
891,
... | {
"red_pajama_v2": {
"ccnet_original_length": 1082,
"ccnet_original_nlines": 42,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.1608695685863495,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.043478261679... | {
"free_decimal_correspondence": {
"primary": {
"code": "912.03",
"labels": {
"level_1": "History and Geography",
"level_2": "Geography and Voyages and travels",
"level_3": "Maps and Atlases"
}
},
"secondary": {
"code": "942",
"labels": {
"leve... | 63cf76e62422470f51c1029c86092532 |
4,531,201,455,813,753,300 | ASSOCIATED PRESS COVERAGE
Sales gains lift Dollar Tree's 1Q profit 15 pct
RICHMOND, Va. (AP) -- Discount retailer Dollar Tree Inc. said Thursday that its net income increased 15 percent in the first quarter as consumers spent more at its stores, which sell goods for $1 or less....
Stocks fall on Fed, weak Chinese man... | {
"url": "http://customwire.ap.org/dynamic/fronts/MONEY_COMPLETE?SITE=WVHUN&SCTION=MONEY_COMPLETE",
"source_domain": "customwire.ap.org",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "32157",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "... | {
"line_start_idx": [
0,
26,
27,
75,
283,
284,
331,
565,
566,
603,
784,
785,
823,
1112,
1113,
1157,
1316,
1317,
1367,
1589,
1590,
1640,
1881,
1882,
1933,
2106,
2107,
2150,
2321,
2322,
2371,
... | {
"red_pajama_v2": {
"ccnet_original_length": 2944,
"ccnet_original_nlines": 46,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.24416516721248627,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.07001794874... | {
"free_decimal_correspondence": {
"primary": {
"code": "332.022",
"labels": {
"level_1": "Social sciences",
"level_2": "Economics",
"level_3": "Finance"
}
},
"secondary": {
"code": "330.9",
"labels": {
"level_1": "Social sciences",
"le... | 63cf76e62422470f51c1029c86092532 |
-7,210,321,145,682,641,000 |
Listen to free music by S Club 7
Sign Up for Blip.fm
Blip.fm is internet radio made social. It's easy to search for, play, and discover free music recommended by real people. Join today to create your own free station.
You
loading...
What artist / song do you want to Blip?
What artist / song do you want to Blip... | {
"url": "http://blip.fm/listen/S+Club+7",
"source_domain": "blip.fm",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "142722",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:JIE3YRXL6KAQ5RIIPBNNSZHAM3UVGTC3",
"WARC-Concurrent-To": ... | {
"line_start_idx": [
0,
2,
3,
36,
37,
57,
58,
224,
225,
229,
231,
242,
282,
322,
337,
341,
343,
344,
358,
359,
481,
482,
484,
485,
494,
495,
504,
508,
517,
519,
586,
588,
677,
679,
... | {
"red_pajama_v2": {
"ccnet_original_length": 1919,
"ccnet_original_nlines": 83,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.14685314893722534,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.06526806950... | {
"free_decimal_correspondence": {
"primary": {
"code": "782.0",
"labels": {
"level_1": "Arts",
"level_2": "Music",
"level_3": "Dramatic music"
}
},
"secondary": {
"code": "004.67",
"labels": {
"level_1": "General works, books and libraries, in... | 63cf76e62422470f51c1029c86092532 |
8,547,462,437,287,591,000 | Definition of Supermarkets
1. Noun. (plural of supermarket) ¹
¹ Source: wiktionary.com
Definition of Supermarkets
1. supermarket [n] - See also: supermarket
Supermarkets Pictures
Click the following link to bring up a new window with an automated collection of images related to the term: Supermarkets Images
Lexi... | {
"url": "http://www.lexic.us/definition-of/Supermarkets",
"source_domain": "www.lexic.us",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "16036",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:IIRFHSJYXSPIURBNXF3LXBJGGPH7EMFR",
"W... | {
"line_start_idx": [
0,
27,
28,
63,
64,
89,
90,
117,
118,
161,
162,
184,
185,
315,
316,
358,
359,
370,
380,
391,
400,
413,
427,
448,
462,
477,
487,
499,
514,
528,
543,
559,
572,
582... | {
"red_pajama_v2": {
"ccnet_original_length": 2578,
"ccnet_original_nlines": 71,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.2827586233615875,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.006896549835... | {
"free_decimal_correspondence": {
"primary": {
"code": "658.82",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Business",
"level_3": "Management"
}
},
"secondary": {
"code": "658.8",
"labels": {
"level_1": ... | 63cf76e62422470f51c1029c86092532 |
-7,357,380,952,551,394,000 | Last updated: May 26, 2013
Weather: Sydney 9°C - 21°C . Sunny.
NN SPORT AFL TEAM Flag Adelaide
NN Sport AFL Flag Brisbane
NN Sport AFL Flag Carlton
NN Sport AFL Flag Collingwood
NN Sport AFL Flag Essendon
NN Sport AFL Flag Essendon
NN Sport AFL Flag Geelong
NN Sport AFL Flag Gold Coast
NN Sport AFL Flag GWS
NN Sport ... | {
"url": "http://www.news.com.au/sport/afl/stage-show-relives-best-and-and-worst-of-footy-legend-ron-barassi/story-fnelctok-1226476108891?from=public_rss",
"source_domain": "www.news.com.au",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "67377",
"Content-Type": "applicati... | {
"line_start_idx": [
0,
27,
28,
64,
65,
97,
124,
150,
180,
207,
234,
260,
289,
311,
338,
366,
400,
432,
462,
489,
514,
543,
578,
579,
588,
589,
655,
656,
668,
669,
910,
911,
972,
97... | {
"red_pajama_v2": {
"ccnet_original_length": 3738,
"ccnet_original_nlines": 121,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.31081080436706543,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0810810774... | {
"free_decimal_correspondence": {
"primary": {
"code": "796.334",
"labels": {
"level_1": "Arts",
"level_2": "Amusements and Recreation",
"level_3": "Sports and Athletics"
}
},
"secondary": {
"code": "791.4372",
"labels": {
"level_1": "Arts",
... | 63cf76e62422470f51c1029c86092532 |
3,457,151,915,812,829,700 | WIFR - News - Operation Safer Streets
Operation Safer Streets Headlines
More Operation Safer Streets Headlines
Contact Crime Stoppers
Rockford Crime Stoppers Administrative Office
P.O. Box 4535
Rockford, IL 61110
Phone: 1-815-963-7867
Toll Free: 1-888-769-STOP
Fax: 1-815-961-3206
Today's Poll
Do you agree with... | {
"url": "http://www.wifr.com/news/operationsaferstreets",
"source_domain": "www.wifr.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "72151",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:RWPFPKLCKGT5UXFE3WZ4NQGJIDZ7XLCJ",
"W... | {
"line_start_idx": [
0,
39,
40,
74,
75,
114,
115,
138,
139,
140,
186,
200,
219,
220,
242,
268,
288,
289,
302,
303,
420,
421,
425,
428,
447,
448,
449,
450,
451,
582,
687
],
"line_end_idx": [... | {
"red_pajama_v2": {
"ccnet_original_length": 708,
"ccnet_original_nlines": 30,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.07975459843873978,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.049079749733... | {
"free_decimal_correspondence": {
"primary": {
"code": "363.25",
"labels": {
"level_1": "Social sciences",
"level_2": "Social service and Societies",
"level_3": "Political activists"
}
},
"secondary": {
"code": "320.11",
"labels": {
"level_1":... | 63cf76e62422470f51c1029c86092532 |
334,083,588,758,840,700 | YOUR FRIENDS' ACTIVITY
Elie Saab Lace and Valentino White? Looks Like Jessica Biel Can't Wait For Her Wedding Day!
Jessica Biel has had more than her fair share of red carpet moments lately, as she goes on an international tour to promote new film Total Recall. We’ve loved her gorgeous and super stylish outfi... | {
"url": "http://uk.lifestyle.yahoo.com/elie-saab-lace-valentino-white-looks-like-jessica-165910603.html",
"source_domain": "uk.lifestyle.yahoo.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "100845",
"Content-Type": "application/http; msgtype=response",
"WARC-Blo... | {
"line_start_idx": [
0,
23,
24,
120,
121,
418,
419,
724,
725,
1013,
1014,
1210,
1211,
1657,
1658,
1854,
1855,
2248,
2249,
2445,
2446,
2813,
2814,
2815,
2836,
2837,
2838,
2839,
2898,
2952
],
"li... | {
"red_pajama_v2": {
"ccnet_original_length": 3009,
"ccnet_original_nlines": 29,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3300165832042694,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.008291870355... | {
"free_decimal_correspondence": {
"primary": {
"code": "746.92",
"labels": {
"level_1": "Arts",
"level_2": "Drawing, Decoration and ornament, and Design",
"level_3": "Needlework and Fancy work"
}
},
"secondary": {
"code": "791.4372",
"labels": {
... | 63cf76e62422470f51c1029c86092532 |
-2,995,140,954,298,257,000 |
The first table below looks at the prevalence of the term Norwegian Language in IT jobs advertised for the Cambridge region. Included is a guide to the average salaries offered in IT jobs that have cited Norwegian Language over the 3 months to 19 June 2013 with a comparison to the same period in the previous 2 yea... | {
"url": "http://www.itjobswatch.co.uk/jobs/cambridge/norwegian%20language.do",
"source_domain": "www.itjobswatch.co.uk",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "28218",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:RUFFLNTVIWQ... | {
"line_start_idx": [
0,
2,
4,
5,
447,
448,
457,
469,
481,
515,
534,
545,
592,
649,
684,
716,
737,
796,
818,
833,
861,
871,
912,
929,
941,
1013,
1051,
1090,
1127,
1142,
1200,
1258,
1317,... | {
"red_pajama_v2": {
"ccnet_original_length": 3478,
"ccnet_original_nlines": 94,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.1540403962135315,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.026515150442... | {
"free_decimal_correspondence": {
"primary": {
"code": "331.7044",
"labels": {
"level_1": "Social sciences",
"level_2": "Economics",
"level_3": "Labor and Capital"
}
},
"secondary": {
"code": "004.0285",
"labels": {
"level_1": "General works, ... | 63cf76e62422470f51c1029c86092532 |
-9,165,003,614,621,566,000 | Tuesday, October 9, 2012
Nature Study with an Autumn Scavenger Hunt (and a free printable!)
I love this time of year. Crisp, cool mornings, changing leaves, hot apple cider.
And nature study.
This is just the best time of year to take your kids on a trail hike, visit a local park, or even explore what's in your o... | {
"url": "http://www.benandme.com/2012/10/nature-study-with-autumn-scavenger-hunt.html",
"source_domain": "www.benandme.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "191214",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:6STGVK... | {
"line_start_idx": [
0,
25,
26,
93,
94,
177,
178,
197,
198,
334,
335,
589,
590,
714,
715,
716,
717,
735,
736,
822,
823,
835,
836,
849,
850,
865,
866,
893,
894
],
"line_end_idx": [
25,
26,
... | {
"red_pajama_v2": {
"ccnet_original_length": 922,
"ccnet_original_nlines": 28,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3700000047683716,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0099999997764... | {
"free_decimal_correspondence": {
"primary": {
"code": "508",
"labels": {
"level_1": "Science and Natural history",
"level_2": "",
"level_3": ""
}
},
"secondary": {
"code": "646.7",
"labels": {
"level_1": "Industrial arts, Technology, and Engi... | 63cf76e62422470f51c1029c86092532 |
-6,929,643,867,930,982,000 | Actuarial Outpost
Go Back Actuarial Outpost > Exams - Please Limit Discussion to Exam-Related Topics > SoA > Modules 1-5
FlashChat Actuarial Discussion Preliminary Exams CAS/SOA Exams Cyberchat Around the World Suggestions
Berlin - Madrid - Rome - Paris - Hamburg - Warsaw
Barcelona - Vienna - Milan - Munich - Prag... | {
"url": "http://www.actuarialoutpost.com/actuarial_discussion_forum/showthread.php?t=5066&page=8",
"source_domain": "www.actuarialoutpost.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "71121",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Dig... | {
"line_start_idx": [
0,
18,
20,
125,
227,
228,
278,
333,
358,
402,
447,
448,
449,
455,
457,
484,
492,
517,
552,
559,
561,
581,
602,
613,
621,
622,
781,
782,
851,
868,
876,
901,
938,
... | {
"red_pajama_v2": {
"ccnet_original_length": 3712,
"ccnet_original_nlines": 164,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.23104692995548248,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.0457280389... | {
"free_decimal_correspondence": {
"primary": {
"code": "368.01",
"labels": {
"level_1": "Social sciences",
"level_2": "Social service and Societies",
"level_3": "Insurance"
}
},
"secondary": {
"code": "025.044",
"labels": {
"level_1": "General... | 63cf76e62422470f51c1029c86092532 |
8,604,370,070,522,151,000 | Stevie Gorrie
Art Director, The Kit
What I Want
Kenneth Jay Lane 18-karat gold-plated cubic zirconia frog necklace, $275, net-a-porter.com
“I never splurge on statement jewellery. I love that this is gorgeous but still playful.”
What Id Give
Gucci leather briefcase, $1,590, mrporter.com
“My boyfriends briefcase ... | {
"url": "http://www.thekit.ca/wish-list/stevie-gorrie-2/",
"source_domain": "www.thekit.ca",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "63231",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:PK6DOAARZVVUQGHJVD74IZW5FTDHY4U3",
... | {
"line_start_idx": [
0,
14,
15,
37,
38,
50,
51,
142,
232,
233,
247,
248,
294,
451,
452,
504,
657,
658,
696,
810,
811,
854,
1030,
1031,
1056,
1084,
1123,
1135,
1147,
1159,
1171,
1172,
12... | {
"red_pajama_v2": {
"ccnet_original_length": 1343,
"ccnet_original_nlines": 37,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3148788809776306,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.020761249586... | {
"free_decimal_correspondence": {
"primary": {
"code": "745.5",
"labels": {
"level_1": "Arts",
"level_2": "Drawing, Decoration and ornament, and Design",
"level_3": "Decorative arts"
}
},
"secondary": {
"code": "646.7",
"labels": {
"level_1": ... | 63cf76e62422470f51c1029c86092532 |
1,944,958,002,346,721,800 | Sunday, 20 May 2012
Hollywood Rides a Bike
Published earlier this year, Hollywood Rides a Bike - Cycling with the Stars fuses glamour and style with one of my favourite pastimes - cycling. Its photographs of stars on their bikes show how the sport need not necessarily involve sweat, Lycra and carbon fibre.
The book ... | {
"url": "http://www.greyfoxblog.com/2012/05/hollywood-rides-bike.html",
"source_domain": "www.greyfoxblog.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "154038",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:WYROAUBH6BV62OWU43R... | {
"line_start_idx": [
0,
20,
21,
44,
45,
310,
311,
512,
513,
514,
618,
619,
620,
621,
622,
634,
635,
661,
662,
680,
681,
698,
699,
727,
728,
747,
748,
1010,
1011,
1376,
1377,
1613,
1614,... | {
"red_pajama_v2": {
"ccnet_original_length": 2384,
"ccnet_original_nlines": 40,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.43207547068595886,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.02452830038... | {
"free_decimal_correspondence": {
"primary": {
"code": "796.5",
"labels": {
"level_1": "Arts",
"level_2": "Amusements and Recreation",
"level_3": "Sports and Athletics"
}
},
"secondary": {
"code": "770",
"labels": {
"level_1": "Arts",
... | 63cf76e62422470f51c1029c86092532 |
4,172,841,470,441,152,500 | spacer
Advanced Search
Astrobiology Magazine Facebook Astrobiology Magazine Twitter
Spiral Galaxy NGC 1672 from Hubble
05/13/12
Many spiral galaxies have bars across their centers. Even our own Milky Way Galaxy is thought to have a modest central bar. Prominently barred spiral galaxy NGC 1672, pictured above, was ca... | {
"url": "http://www.astrobio.net/index.php?option=com_galleryimg&task=imageofday&imageId=1126&pageNo=18",
"source_domain": "www.astrobio.net",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "37468",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Dige... | {
"line_start_idx": [
0,
7,
9,
25,
87,
122,
131,
940,
975,
991,
992,
993,
995,
996,
1123,
1124,
1126,
1127,
1129,
1130,
1132,
1133,
1142,
1153,
1159,
1167,
1184,
1210,
1237,
1258,
1306
],
"l... | {
"red_pajama_v2": {
"ccnet_original_length": 1336,
"ccnet_original_nlines": 30,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.20522387325763702,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.02985074929... | {
"free_decimal_correspondence": {
"primary": {
"code": "523.132",
"labels": {
"level_1": "Science and Natural history",
"level_2": "Astronomy",
"level_3": ""
}
},
"secondary": {
"code": "523.13",
"labels": {
"level_1": "Science and Natural his... | 63cf76e62422470f51c1029c86092532 |
-6,563,246,493,409,928,000 |
My Family Tree (as I know it)
Entries: 24734 Updated: 2008-12-27 20:29:12 UTC (Sat) Contact: Lori Home Page: Visit my family blog if you have a moment.
Lee/Ferguson, Johnson/Kokko, Ellsworth/Smith/Christensen, Robbins/Peltier, Dowdle/Capener. This file is a work in progress. It is a compilation of my own re... | {
"url": "http://wc.rootsweb.ancestry.com/cgi-bin/igm.cgi?db=famtree10",
"source_domain": "wc.rootsweb.ancestry.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "6798",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:AWXKHDMJMVMZ563C... | {
"line_start_idx": [
0,
1,
31,
32,
163,
164,
554,
555,
596,
612,
679,
799,
800,
922,
923,
990,
991
],
"line_end_idx": [
1,
31,
32,
163,
164,
554,
555,
596,
612,
679,
799,
800,
922,
923,
... | {
"red_pajama_v2": {
"ccnet_original_length": 1274,
"ccnet_original_nlines": 16,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.2836363613605499,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.119999997317... | {
"free_decimal_correspondence": {
"primary": {
"code": "929.1",
"labels": {
"level_1": "History and Geography",
"level_2": "Biography",
"level_3": ""
}
},
"secondary": {
"code": "025.2",
"labels": {
"level_1": "General works, books and librari... | 63cf76e62422470f51c1029c86092532 |
-1,235,873,293,340,587,300 | Cheap FSD to Sheridan flights
Book your FSD to SHR flights with Expedia and find last-minute Sioux Falls to Sheridan airfare. Expedia offers discount airfare on multiple airline carriers that fly direct and indirect routes between FSD and SHR, with new flight deals and promotions almost daily. When you book your next ... | {
"url": "http://flights.expedia.com/flights-from-sioux-falls-to-sheridan-fsd-to-shr/",
"source_domain": "flights.expedia.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "80403",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:YA4L6... | {
"line_start_idx": [
0,
30,
31,
517,
518,
572,
573,
811,
847,
893,
894,
931,
932,
1226,
1231,
1239,
1244,
1252,
1298,
1299,
1358,
1359,
1396,
1397,
1431,
1432,
1466,
1473,
1480,
1481,
1502,
... | {
"red_pajama_v2": {
"ccnet_original_length": 2069,
"ccnet_original_nlines": 45,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.2864721417427063,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.045092839747... | {
"free_decimal_correspondence": {
"primary": {
"code": "388.068",
"labels": {
"level_1": "Social sciences",
"level_2": "Commerce and Communication and traffic",
"level_3": "Local transit"
}
},
"secondary": {
"code": "388.06",
"labels": {
"leve... | 63cf76e62422470f51c1029c86092532 |
-2,920,490,037,508,219,400 | Madhya-līlāChapter 14: Performance of the Vṛndāvana Pastimes
Bhaktivedanta VedaBase: Śrī Caitanya Caritāmṛta Madhya 14.139
ińho nija-sampatti saba prakaṭa kariyā
priyera upara yāya sainya sājāñā
SYNONYMS
ińho — this; nija-sampatti — her opulence; saba — all; prakaṭa kariyā — manifesting; priyera upara — agains... | {
"url": "http://vedabase.net/cc/madhya/14/139/",
"source_domain": "vedabase.net",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "4157",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:4Z7LLNLY3YA7T2FH52NBOL7ORMZOS4BK",
"WARC-Concur... | {
"line_start_idx": [
0,
62,
63,
126,
127,
167,
168,
201,
202,
211,
212,
397,
398,
410,
411,
576,
577,
585,
586,
1358,
1359,
1367,
1368,
1440
],
"line_end_idx": [
62,
63,
126,
127,
167,
168,
... | {
"red_pajama_v2": {
"ccnet_original_length": 1564,
"ccnet_original_nlines": 23,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.33220338821411133,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.02372881025... | {
"free_decimal_correspondence": {
"primary": {
"code": "294.542",
"labels": {
"level_1": "Religion",
"level_2": "Religions",
"level_3": "Hinduism, Brahmanism, Buddhism, and Sikhism"
}
},
"secondary": {
"code": "294.54",
"labels": {
"level_1": ... | 63cf76e62422470f51c1029c86092532 |
-3,302,979,337,496,770,000 | Join Us for English: Activity Book (Paperback)
By
Gunter Gerngross
(Author),
Herbert Puchta
(Author)
Write a Review
Cash On Delivery Available
List Price: Rs. 814 + (Sourcing Fee: 57)
Our Price: Rs. 871 ($ 16.55) (£ 9.81)
Imported Edition. FREE Shipping in India!
Ships in 5-7 business days.
Call +91-79-4026... | {
"url": "http://www.infibeam.com/Books/info/Gunter-Gerngross/Join-Us-1-Activity-Book/0521681170.html",
"source_domain": "www.infibeam.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "120074",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest... | {
"line_start_idx": [
0,
47,
48,
52,
53,
70,
71,
83,
84,
99,
100,
111,
126,
153,
195,
233,
234,
276,
304,
342,
381,
420,
421,
473,
474,
1652,
1653,
1666,
1667,
1668,
1710,
1748,
1789,
... | {
"red_pajama_v2": {
"ccnet_original_length": 4006,
"ccnet_original_nlines": 94,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.22480620443820953,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.00645995000... | {
"free_decimal_correspondence": {
"primary": {
"code": "428.2",
"labels": {
"level_1": "Philology; or, Language and languages",
"level_2": "English language",
"level_3": "English language — Study and teaching and English language — Usage"
}
},
"secondary": {
... | 63cf76e62422470f51c1029c86092532 |
-6,155,150,646,723,492,000 | Close
4.7500 5 4 $ Grand Marais Lighthouse Restaurant
1 next >>
Grand Marais Lighthouse Restaurant
5802 Lake Drive
Centreville, IL 62205
Phone: (618) 398-9958
Cuisine: American
Price: $
Map & Directions
• Hours
Monday: 9:00 AM - 9:00 PM
Tuesday: 9:00 AM - 9:00 PM
Wednesday: 9:00 AM - 9:00 PM
Thurs... | {
"url": "http://www.restaurant.com/grand-marais-lighthouse-restaurant-centreville-american-pid=54617",
"source_domain": "www.restaurant.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "36998",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Diges... | {
"line_start_idx": [
0,
6,
54,
64,
65,
100,
101,
117,
139,
161,
179,
188,
189,
206,
207,
217,
247,
278,
311,
343,
373,
405,
435,
476,
523,
541,
558,
613,
654,
704,
796,
837,
887,
88... | {
"red_pajama_v2": {
"ccnet_original_length": 2098,
"ccnet_original_nlines": 58,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.20232558250427246,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.03720929846... | {
"free_decimal_correspondence": {
"primary": {
"code": "647.95",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Home economics",
"level_3": "Household employees, Caterers and catering, and Real estate management"
}
},
"secondar... | 63cf76e62422470f51c1029c86092532 |
4,873,512,426,590,693,000 | 10 Things I Love Tuesday / Green Living
10 Things I Love Tuesday
1. .WHEN ACTRESS + INDIE ARTIST = BAND. [source] 2. .CROWDED SIDEWALKS. I’ve always been obsessed with Natalie Portman’s sidewalk scene in Closer. I’ve even listened to “The Blower’s Daughter ” by Damien Rice in crowded commutes to recreate the same ex... | {
"url": "http://lavieboston.com/tag/reusable-bags/",
"source_domain": "lavieboston.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "32686",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:QXTTDGS522JLR5GLZRZEYKNOR5XRZ2J3",
"WAR... | {
"line_start_idx": [
0,
40,
41,
66,
67
],
"line_end_idx": [
40,
41,
66,
67,
433
]
} | {
"red_pajama_v2": {
"ccnet_original_length": 433,
"ccnet_original_nlines": 4,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.17000000178813934,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.1400000005960... | {
"free_decimal_correspondence": {
"primary": {
"code": "700.0",
"labels": {
"level_1": "Arts",
"level_2": "",
"level_3": ""
}
},
"secondary": {
"code": "363.7",
"labels": {
"level_1": "Social sciences",
"level_2": "Social service and S... | 63cf76e62422470f51c1029c86092532 |
6,198,972,250,942,115,000 | Home | FDCI Members
Hemant Khandelwaal
With photographs published in magazines such as Vogue, Cosmopolitan and Elle, and with clients among the top fashion designers in India and the world, it may be surprising that Hemant Khandelwaal says he entered the field of photography purely by chance. After working for seve... | {
"url": "http://www.fdci.org/Member.aspx?mid=1074608218&cat=6",
"source_domain": "www.fdci.org",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "44349",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:ONJ2G4ORSPSPF4UFROX65KULGKCT2DPV",
... | {
"line_start_idx": [
0,
22,
23,
42,
43,
1350,
1351,
1364,
1374,
1395,
1418,
1424,
1439,
1459,
1479,
1488,
1518,
1530,
1556,
1569,
1598,
1912,
1930,
1952,
1966,
1981
],
"line_end_idx": [
22,
23,
42,... | {
"red_pajama_v2": {
"ccnet_original_length": 2028,
"ccnet_original_nlines": 25,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3076923191547394,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.007957560010... | {
"free_decimal_correspondence": {
"primary": {
"code": "770.0",
"labels": {
"level_1": "Arts",
"level_2": "Photography",
"level_3": ""
}
},
"secondary": {
"code": "741.0",
"labels": {
"level_1": "Arts",
"level_2": "Drawing, Decoration ... | 63cf76e62422470f51c1029c86092532 |
8,202,382,779,431,155,000 | La Boca
•
•
– The Emperor Machine DCR91
PDF Portfolio
Download a high-resolution print quality PDF of La Boca’s portfolio. mydébut account holders can create custom PDFs by exporting their lightboxes.
Please note the usage of this PDF is subject to the Client Terms & Conditions.
Request Portfolio
Complet... | {
"url": "http://www.debutart.com/illustration/la-boca/the-emperor-machine-dcr91",
"source_domain": "www.debutart.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "34443",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:65M35IXZ3NSJX... | {
"line_start_idx": [
0,
8,
9,
15,
21,
50,
51,
65,
66,
213,
214,
293,
294,
312,
313,
373,
374,
403,
404,
453,
454,
464,
476,
477,
662,
663,
704,
705,
1081,
1082
],
"line_end_idx": [
8,
9... | {
"red_pajama_v2": {
"ccnet_original_length": 1347,
"ccnet_original_nlines": 29,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.26070040464401245,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.02723735012... | {
"free_decimal_correspondence": {
"primary": {
"code": "741.6",
"labels": {
"level_1": "Arts",
"level_2": "Drawing, Decoration and ornament, and Design",
"level_3": "Drawing — Technique and Caricatures and cartoons"
}
},
"secondary": {
"code": "745.5",
... | 63cf76e62422470f51c1029c86092532 |
-5,650,358,683,257,280,000 | new
Compare Quotes from TRUSTED Repair Shops
100% Satisfaction Guarantee PLUS 5% Cash Back on Repairs
Available in:
Bakersfield, CA
Spokane, WA
Long Island, NY
Get Quotes
Home » Find a Shop » CA » Hesperia » GMC
GMC Repair Shops in Hesperia, California
Find your way to the best GMC service center in Hesperia, Cali... | {
"url": "http://www.automd.com/shops/CA/hesperia/gmc/",
"source_domain": "www.automd.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "48119",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:OOZJHE3WAJLQ5T5Q3LCFZD5SSWDG63GX",
"W... | {
"line_start_idx": [
0,
4,
5,
46,
47,
104,
105,
119,
135,
147,
163,
174,
215,
216,
257,
258,
340,
341,
775,
776,
797,
798,
845,
846
],
"line_end_idx": [
4,
5,
46,
47,
104,
105,
119,
135... | {
"red_pajama_v2": {
"ccnet_original_length": 851,
"ccnet_original_nlines": 23,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.34090909361839294,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.079545453190... | {
"free_decimal_correspondence": {
"primary": {
"code": "629.22",
"labels": {
"level_1": "Industrial arts, Technology, and Engineering",
"level_2": "Engineering",
"level_3": ""
}
},
"secondary": {
"code": "658.85",
"labels": {
"level_1": "Indus... | 63cf76e62422470f51c1029c86092532 |
-6,341,579,750,969,416,000 |
Go Symbol Lookup
Loading...
How Bad is the DVD Decline and Who Suffers?
Text Size
Published: Friday, 17 Jul 2009 | 4:28 PM ET
Julia Boorstin By:
CNBC Media and Entertainment Reporter
DVD sales used to be the bread and butter of the movie studios business, even more important to the bottom line than box office ... | {
"url": "http://www.cnbc.com/id/31969248",
"source_domain": "www.cnbc.com",
"snapshot_id": "crawl=CC-MAIN-2013-20",
"warc_metadata": {
"Content-Length": "67804",
"Content-Type": "application/http; msgtype=response",
"WARC-Block-Digest": "sha1:WDIZDKXAUSNT6K46YMQV7PPVJIUQQZNR",
"WARC-Concurrent-... | {
"line_start_idx": [
0,
2,
19,
30,
31,
75,
76,
89,
133,
152,
153,
191,
192,
512,
513,
516,
517,
684,
685,
829,
830,
852,
853,
962,
963,
1137,
1138,
1348,
1349,
1358,
1359,
1482,
1483,
... | {
"red_pajama_v2": {
"ccnet_original_length": 2439,
"ccnet_original_nlines": 73,
"rps_doc_curly_bracket": 0,
"rps_doc_ldnoobw_words": 0,
"rps_doc_lorem_ipsum": 0,
"rps_doc_stop_word_fraction": 0.3368644118309021,
"rps_doc_ut1_blacklist": 0,
"rps_doc_frac_all_caps_words": 0.027542369440... | {
"free_decimal_correspondence": {
"primary": {
"code": "338.4",
"labels": {
"level_1": "Social sciences",
"level_2": "Economics",
"level_3": "Industries, Prices, and Microeconomics"
}
},
"secondary": {
"code": "004.67",
"labels": {
"level_1": ... | 63cf76e62422470f51c1029c86092532 |
🌐 Essential-Web: Complete 24-Trillion Token Dataset
🏆 Website | 🖥️ Code | 📖 Paper | ☁️ AWS
📋 Dataset Description
Essential-Web is a 24-trillion-token web dataset with document-level metadata designed for flexible dataset curation. The dataset provides metadata including subject matter classification, web page type, content complexity, and document quality scores for each of the 23.6 billion documents.
Researchers can filter and curate specialized datasets using the provided metadata, reducing the need for custom preprocessing pipelines and domain-specific classifiers.
🔍 Free Decimal Correspondence (FDC) Taxonomy
Essential-Web uses the Free Decimal Correspondence, a Dewey Decimal-inspired open taxonomy with 12 main categories for classifying web content. This systematic approach enables precise domain filtering and dataset curation.
For help navigating FDC codes, see: https://www.librarything.com/mds
⚙️ Dataset Creation
Essential-Web was created using a comprehensive processing pipeline starting from Common Crawl data:
📥 Source Data
- DCLM Pool: 89 resiliparse-extracted Common Crawl WARC snapshots (CC-MAIN-2013-20 to CC-MAIN-2022-49)
- Additional Snapshots: 12 additional snapshots extracted from CC-MAIN-2023-06 to CC-MAIN-2024-38 using resiliparse
- Total: 101 Common Crawl snapshots processed
🔧 Processing Pipeline
- Document ID Generation: Using xxhash.xxh3_64_intdigest for unique document identification
- Global Deduplication: Hash-based deduplication across all 101 snapshots
- Minhash LSH Deduplication: Snapshot-level deduplication with Jaccard threshold of 0.7 (14 bands, 9 rows per band)
- Quality Annotation: Statistical and model-based quality signals using RedPajama-Data-V2 pipeline variant, including DCLM-baseline fastText classifier
- Quality Filtering: Manual tuned filters to retain high-quality English documents while preserving math and code content
- Taxonomy Labeling: Classification of every document using EAI-Taxonomy-0.5b (~90,000 AMD MI300x GPU-hours)
🎯 Performance & Validation
We've curated example domain-specific datasets from Essential-Web using simple metadata filters, showing competitive performance relative to top performing web-curated datasets:
- 🧮 Math: within 8.0% of web-curated baselines
- 💻 Web Code: 14.3% above web-curated baselines
- 🔬 STEM: 24.5% above web-curated baselines
- 🩺 Medical: 8.6% above web-curated baselines
Note: These represent initial examples with significant room for further curation and improvement. Comparisons are against web-sourced datasets rather than specialized synthetic datasets.
🚀 Related Datasets & Models
Domain-Specific Datasets
We've curated high-quality domain-specific datasets from Essential-Web:
- Math: EssentialAI/eai-taxonomy-math-w-fm
- Code: EssentialAI/eai-taxonomy-code-w-dclm
- Medical: EssentialAI/eai-taxonomy-med-w-dclm
- STEM: EssentialAI/eai-taxonomy-stem-w-dclm
Classification Model
- EAI-Taxonomy-0.5b: EssentialAI/eai-taxonomy-0.5b - The efficient classifier used to label Essential-Web documents
🎯 Intended Use
Essential-Web enables researchers to:
- 🚀 Rapid Curation: Create multi-billion-token domain-specific datasets in minutes using SQL-like filters
- 🔍 Flexible Exploration: Explore web content across subjects, quality levels, and content types
- 🏗️ Custom Pipelines: Build specialized training corpora without custom classification infrastructure
- 🔄 Iterative Improvement: Easily modify and refine dataset composition based on training results
- 📊 Quality Control: Filter out low-quality content (ads, product listings) while preserving reasoning-dense documents
Dataset Schema Documentation
Overview
This dataset contains web-crawled text data with comprehensive metadata, quality signals, and taxonomic classifications. Each record represents a document extracted from web archives with detailed provenance tracking and quality assessment metrics.
Core Fields
| Field | Type | Description | Path |
|---|---|---|---|
id |
Int64 |
Unique identifier based on document hash | id |
text |
String |
The main textual content of the document | text |
EAI Taxonomy Classification
Comprehensive hierarchical classification system with primary and secondary labels - the most important feature of this dataset. The taxonomy is designed to provide detailed subject categorization, document type identification, content quality assessment, and extraction quality indicators.
How to Load the Dataset
This section provides examples of how to load the EssentialAI/essential-web-v1.0 dataset using different Python libraries and frameworks.
Using Hugging Face Datasets (Standard Method)
The simplest way to load the dataset is using the Hugging Face datasets library:
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("EssentialAI/essential-web-v1.0")
# View dataset structure
print(dataset)
print(f"Number of examples: {len(dataset['train'])}")
You can also load the dataset in streaming mode to avoid downloading the entire dataset at once:
from datasets import load_dataset
# Load in streaming mode
dataset = load_dataset("EssentialAI/essential-web-v1.0", streaming=True)
data_stream = dataset["train"]
# Iterate through examples
for example in data_stream.take(5):
print(example)
Using PySpark
For large-scale distributed processing, you can load the dataset using PySpark with the pyspark_huggingface library:
# First install the required library:
# pip install pyspark_huggingface
import pyspark_huggingface
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("EAI-Taxonomy-Web").getOrCreate()
# Load the dataset using the "huggingface" data source
df = spark.read.format("huggingface").load("EssentialAI/essential-web-v1.0")
# Basic dataset exploration
print(f"Dataset shape: {df.count()} rows, {len(df.columns)} columns")
df.show(10)
df.printSchema()
# Load only specific columns for efficiency
df_subset = (
spark.read.format("huggingface")
.option("columns", '["column1", "column2"]') # Replace with actual column names
.load("EssentialAI/essential-web-v1.0")
)
# Run SQL queries on the dataset
df.createOrReplaceTempView("eai_web_dataset")
result = spark.sql("""
SELECT COUNT(*) as total_examples
FROM eai_web_dataset
""")
result.show()
Using Daft
Daft provides a modern DataFrame library optimized for machine learning workloads. You can load the dataset directly from Hugging Face:
import daft
# Load the entire dataset
df = daft.read_parquet("hf://datasets/EssentialAI/essential-web-v1.0")
# Basic exploration
print("Dataset schema:")
df.schema()
print("First 5 rows:")
df.show(5)
If you need to access private datasets or use authentication:
import daft
from daft.io import IOConfig, HTTPConfig
io_config = IOConfig(http=HTTPConfig(bearer_token="your_token"))
df = daft.read_parquet("hf://datasets/EssentialAI/essential-web-v1.0", io_config=io_config)
Installation Requirements
Make sure you have the required libraries installed:
# For Hugging Face datasets
pip install datasets
# For PySpark with Hugging Face integration
pip install pyspark_huggingface
# For Daft
pip install daft
📜 License
Essential-Web-v1.0 contributions are made available under the ODC attribution license; however, users should also abide by the Common Crawl - Terms of Use. We do not alter the license of any of the underlying data.
📝 Citation
@misc{ai2025essentialwebv1024ttokens,
title={Essential-Web v1.0: 24T tokens of organized web data},
author={Essential AI and : and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
year={2025},
eprint={2506.14111},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14111},
}
- Downloads last month
- 233,383
