Pricing
Pay per event
CBS 60 Minutes Transcripts Scraper
Collects full interview transcripts from CBS 60 Minutes. Discovers pages via the CBS News article sitemap, extracts the Q&A body, correspondent name, broadcast date, speaker labels, and topic tags. Video-only segments without a published transcript are skipped.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
12 days ago
Last modified
Share
Scrapes full Q&A interview transcripts from CBS News 60 Minutes β the most-recognised US investigative news magazine. Returns one record per transcript page: title, correspondent, broadcast date, subject list, speaker-labeled body text, and topic metadata. Discovers transcript pages automatically from the CBS News article sitemap. Video-only segments without a published transcript are skipped.
60 Minutes is the most-watched US news magazine, known for long-form sit-down interviews with heads of state, CEOs, whistleblowers, and scientists. Each transcript runs 5,000-30,000 words of clean, on-the-record Q&A β high-signal content for media research, RAG pipelines, and investigative journalism datasets.
What It Scrapes
Targets two URL patterns on cbsnews.com:
/news/<slug>-60-minutes-transcript/β primary transcript pattern/news/read-the-full-transcript-of-<slug>/β extended interview variant
Discovery walks the CBS News monthly article sitemaps, filters by these patterns, and scrapes each matching page. Video-only stories (e.g. /news/<slug>-60-minutes/) are explicitly excluded.
Output Schema
| Field | Type | Description |
|---|---|---|
story_slug | string | URL slug of the transcript page |
story_title | string | Article headline |
story_url | string | Canonical CBS News URL |
aired_date | string | Broadcast date (YYYY-MM-DD) |
published_date | string | CBS News publish timestamp (ISO 8601) |
segment_type | string | Inferred type: interview, investigation, or profile |
correspondent | string | CBS News correspondent (e.g. Major Garrett, Lesley Stahl) |
subjects | string | Interviewed subjects extracted from speaker labels (comma-separated) |
synopsis | string | Article dek / meta description |
body_html | string | Full transcript HTML preserving Q&A paragraph structure |
body_text | string | Plain-text version of the transcript |
speakers | string | All speaker labels found in the transcript (comma-separated) |
is_transcript | boolean | Always true β non-transcripts are skipped |
has_video_only_variant | boolean | True when a paired video-only story exists |
related_story_urls | string | Related CBS News links on the page (comma-separated) |
topics | string | CBS News topic tags (comma-separated) |
canonical_url | string | Canonical URL from page head |
source | string | Fixed: cbsnews.com/60-minutes |
scraped_at | datetime | ISO 8601 scrape timestamp |
Speaker labels follow two CBS conventions: Major Garrett: (Title Case) and MAJOR GARRETT: (ALL-CAPS, used in the extended-interview variant). Both formats are normalized and extracted.
Input Options
maxItems (integer, required) β Maximum number of transcript records to scrape. Set a higher value for bulk runs.
startDate (string, optional) β Limit sitemap discovery to a given month onwards (YYYY-MM format, e.g. "2024-01"). Defaults to all available months when omitted.
startUrls (array, optional) β One or more direct CBS News transcript URLs. When provided, sitemap discovery is skipped and only the supplied URLs are scraped. Useful for targeted re-runs of specific episodes.
Example: Specific episode
{"maxItems":1,"startUrls":[{"url":"https://www.cbsnews.com/news/netanyahu-us-israel-iran-60-minutes-transcript/"}]}
Example: All 2025 transcripts
{"maxItems":200,"startDate":"2025-01"}
Example: Full archive (all available transcripts)
{"maxItems":1000}
How It Works
Discovery uses the CBS News sitemap index at cbsnews.com/xml-sitemap/index.xml. Monthly article sitemaps (article-YYYY-MM.xml) are walked in order, newest first. Each sitemap lists 3,000+ news articles; only URLs matching the transcript patterns are fetched.
Metadata is parsed from JSON-LD NewsArticle blocks present on every CBS article page β giving reliable correspondent name, publish date, and keywords. The transcript body lives in <section class="content__body"> as a sequence of <p> tags. Speaker labels are extracted from paragraph-leading Name: patterns. Ad wrappers are stripped before body extraction.
CBS News is server-rendered (varnish edge cache) with no bot-protection observed. No proxy required, no headless browser required.
Coverage Notes
60 Minutes airs approximately 45 episodes per US broadcast season, with 3-4 segments per episode. Roughly 50-70% of segments receive a published transcript β the remainder are video-only. This scraper covers transcript-bearing segments only and makes that boundary explicit in every record (is_transcript: true, video-only pages are skipped). The active transcript archive covers approximately 5 years back, with sparser coverage for earlier seasons.
Pricing
Charged per transcript record scraped. Long-form interviews (5,000-30,000 words each) are priced at a modest premium reflecting their per-record research value versus wire-copy or short-form corpora.
