Instead of adding more and more tools to my existing workflow, I’m always looking for new ways to make the ones I already use work better. There are always hidden features and ways to squeeze more value out of the tools I already rely on. The tool I use most these days in my daily workflow is NotebookLM.
I rely on it extensively, and I've paired it with practically every productivity tool and explored everything it has to offer. The one thing I haven't done yet is extend it with a few simple Python scripts. So naturally, that’s exactly what I tried, and it made my NotebookLM workflow far more powerful and efficient.
Web Scraper
Quickly pull content from any website
Unlike typical AI tools, NotebookLM is designed to help you interact with sources you upload to notebooks. This means the best way to use NotebookLM efficiently is by populating your notebooks with high-quality sources. Ultimately, a significant part of your workflow will likely consist of adding and organizing sources.
Something I’ve heard a lot of people say is that uploading sources as .txt files is always better than using PDFs, since text files are easier to parse and search within NotebookLM. So, a script I’ve been relying on heavily is one that extracts text from a list of URLs and saves them as .txt files!
The script fetches web pages from the internet, cleans up extra spaces, blank lines, and unwanted elements like scripts, styles, headers, images, forms, and buttons. You can add as many URLs as you want to the research_urls list, and it will clean each one and save it as a separate .txt file.
import requests
from bs4 import BeautifulSoup
import re
import os def clean_html_content(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
for element in soup(["script", "style", "header", "footer", "nav", "aside", "form", "button", "iframe", "img"]):
element.decompose()
text = soup.get_text()
text = re.sub(r'\n\s*\n', '\n\n', text)
text = re.sub(r' +', ' ', text)
return text.strip() def scrape_and_save(urls):
output_dir = "notebooklm_sources_web"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print(f"Saving cleaned articles to the '{output_dir}' directory...")
for i, url in enumerate(urls):
try:
print(f"Fetching: {url}")
response = requests.get(url, timeout=10)
response.raise_for_status()
clean_text = clean_html_content(response.content)
title_soup = BeautifulSoup(response.content, 'html.parser')
page_title = title_soup.title.string if title_soup.title else f"article_{i+1}"
filename_base = re.sub(r'[\\/:*?"|]', '_', page_title).strip()
filename = f"{filename_base[:50].strip() or f'article_{i+1}'}.txt"
output_path = os.path.join(output_dir, filename)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(clean_text)
print(f" -> Success: Saved as {filename}")
except requests.exceptions.RequestException as e:
print(f" -> Error fetching {url}: {e}")
except Exception as e:
print(f" -> An unexpected error occurred with {url}: {e}") research_urls = [
# Insert URLs here
"https://www.xda-developers.com/proton-launches-excel-and-google-sheets-alternative/",
"https://www.xda-developers.com/im-never-going-back-to-adobe-acrobat-after-mastering-free-open-source-tool/",
] scrape_and_save(research_urls)
YouTube Transcript Scraper
Videos to text in seconds
Something I use NotebookLM extensively for is "watching" YouTube videos. It's one of my favorite ways to use NotebookLM, and one script I've been using lately is using a script to automatically extract it into a clean .txt file. Again, the same point as above applies — .txt files are much easier for NotebookLM to parse.
The script below is capable of handling multiple languages, and automatically formats paragraphs to make them easy to read. It cleans line breaks and extra whitespace, and saves the transcript as a neatly formatted .txt file. To use the script, all you need to do is run the following command line: python youtube_transcript.py "link here" -o transcript.txt. #!/usr/bin/env python3
import argparse
import re
import sys
try:
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._errors import TranscriptsDisabled, NoTranscriptFound
except ImportError:
print("Required package not found. Install with:")
print(" pip install youtube-transcript-api")
sys.exit(1)
def extract_video_id(url_or_id: str) -> str:
patterns = [
r'(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([a-zA-Z0-9_-]{11})',
r'^([a-zA-Z0-9_-]{11})$'
]
for pattern in patterns:
match = re.search(pattern, url_or_id)
if match:
return match.group(1)
raise ValueError(f"Could not extract video ID from: {url_or_id}")
def format_timestamp(seconds: float) -> str:
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
if hours > 0:
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
return f"{minutes:02d}:{secs:02d}"
def get_transcript(video_id: str, languages: list = None, include_timestamps: bool = False) -> str:
api = YouTubeTranscriptApi()
if languages:
transcript_list = api.list(video_id)
transcript = transcript_list.find_transcript(languages)
transcript_data = transcript.fetch()
else:
try:
transcript_data = api.fetch(video_id, languages=['en'])
except NoTranscriptFound:
transcript_data = api.fetch(video_id)
if include_timestamps:
lines = []
for entry in transcript_data:
timestamp = format_timestamp(entry.start)
text = entry.text.replace('\n', ' ')
lines.append(f"[{timestamp}] {text}")
return '\n'.join(lines)
else:
texts = [entry.text.replace('\n', ' ') for entry in transcript_data]
paragraphs = []
current_paragraph = []
for text in texts:
current_paragraph.append(text)
combined = ' '.join(current_paragraph)
if len(combined.split()) > 100 or (text.rstrip().endswith(('.', '?', '!')) and len(current_paragraph) > 3):
paragraphs.append(combined)
current_paragraph = []
if current_paragraph:
paragraphs.append(' '.join(current_paragraph))
return '\n\n'.join(paragraphs)
def get_video_info(video_id: str) -> dict:
return {
'video_id': video_id,
'url': f"https://www.youtube.com/watch?v={video_id}"
}
def main():
parser = argparse.ArgumentParser(
description='Download YouTube transcripts for NotebookLM',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python youtube_transcript.py "https://youtube.com/watch?v=VIDEO_ID"
python youtube_transcript.py VIDEO_ID -o transcript.txt
python youtube_transcript.py VIDEO_ID --timestamps
python youtube_transcript.py VIDEO_ID --language es
"""
)
parser.add_argument('video', help='YouTube URL or video ID')
parser.add_argument('-o', '--output', help='Output file (default: print to stdout)')
parser.add_argument('--timestamps', action='store_true', help='Include timestamps in output')
parser.add_argument('--language', '-l', help='Preferred language code (e.g., en, es, fr)')
parser.add_argument('--list-languages', action='store_true', help='List available transcript languages')
args = parser.parse_args()
try:
video_id = extract_video_id(args.video)
api = YouTubeTranscriptApi()
if args.list_languages:
transcript_list = api.list(video_id)
print("Available transcripts:")
for transcript in transcript_list:
auto = " (auto-generated)" if transcript.is_generated else ""
print(f" {transcript.language_code}: {transcript.language}{auto}")
return
languages = [args.language] if args.language else None
transcript = get_transcript(video_id, languages, args.timestamps)
video_info = get_video_info(video_id)
output = f"""# YouTube Video Transcript
**Video URL:** {video_info['url']}
**Video ID:** {video_info['video_id']}
---
{transcript}
"""
if args.output:
with open(args.output, 'w', encoding='utf-8') as f:
f.write(output)
print(f"Saved to {args.output}")
else:
print(output)
except TranscriptsDisabled:
print("Error: Transcripts are disabled for this video", file=sys.stderr)
sys.exit(1)
except NoTranscriptFound:
print("Error: No transcript found for this video", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()
File Splitter
From large files to manageable chunks
One of NotebookLM's biggest problems is its source limit. At the time of writing, each source can have a maximum of 500,000 words. One thing I've noticed is that having overly lengthy sources also slows down searches and seems to cause NotebookLM to struggle with parsing effectively.
Instead of splitting files manually, I use a script to automatically break large text files into smaller, manageable chunks, each well below the 500,000-word limit. The script below works on .txt, .csv, .md, and .log files, and saves each split file as a new file with _Part1, _Part2, etc., appended to the original filename for easy uploading into NotebookLM.
import os
MAX_CHARS_PER_FILE = 400000
INPUT_DIR = "files_to_split"
OUTPUT_DIR = "notebooklm_sources_split"
def split_file(input_path, output_dir):
filename = os.path.basename(input_path)
base_name, ext = os.path.splitext(filename)
if ext.lower() not in ['.txt', '.csv', '.md', '.log']:
return
try:
with open(input_path, 'r', encoding='utf-8') as f:
content = f.read()
except:
return
words = content.split()
current_chunk = []
current_char_count = 0
file_count = 1
for word in words:
if current_char_count + len(word) + 1 > MAX_CHARS_PER_FILE:
chunk_content = ' '.join(current_chunk)
output_filename = f"{base_name}_Part{file_count}{ext}"
output_path = os.path.join(output_dir, output_filename)
with open(output_path, 'w', encoding='utf-8') as out_f:
out_f.write(chunk_content)
current_chunk = [word]
current_char_count = len(word)
file_count += 1
else:
current_chunk.append(word)
current_char_count += len(word) + 1
if current_chunk:
chunk_content = ' '.join(current_chunk)
output_filename = f"{base_name}_Part{file_count}{ext}"
output_path = os.path.join(output_dir, output_filename)
with open(output_path, 'w', encoding='utf-8') as out_f:
out_f.write(chunk_content)
def process_directory(input_dir, output_dir):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
if not os.path.exists(input_dir):
os.makedirs(input_dir)
return
for filename in os.listdir(input_dir):
input_path = os.path.join(input_dir, filename)
if os.path.isfile(input_path):
split_file(input_path, output_dir)
process_directory(INPUT_DIR, OUTPUT_DIR)
File Format Converter
No more online converters
One task I find myself doing constantly when using NotebookLM is opening random online file converters to get my documents into the right format. While this is a relatively simple task, converting file formats using online tools doesn’t take much time, having to open a tool, upload the file, wait for conversion, and then download it breaks my workflow and adds unnecessary friction.
The script below can convert files to and from common formats like TXT, Markdown, DOCX, HTML, and even PDFs. Everything happens right on your computer and takes literally seconds. It also supports batch conversions of multiple files at once and can create an output folder if needed.
#!/usr/bin/env python3
import argparse, sys, re
from pathlib import Path
DOCX_AVAILABLE = False
MARKDOWNIFY_AVAILABLE = False
PDF_AVAILABLE = False
try: from docx import Document; DOCX_AVAILABLE = True
except ImportError: pass
try: from markdownify import markdownify; MARKDOWNIFY_AVAILABLE = True
except ImportError: pass
try: import PyPDF2; PDF_AVAILABLE = True
except ImportError: pass
def txt_to_md(text, title=None):
lines=text.split('\n'); result=[]
if title: result.append(f"# {title}\n")
for line in lines:
line=line.rstrip()
if line.isupper() and 3
I wish I had started using these Python scripts sooner
I've been using NotebookLM since it launched, and I'm surprised it took me so long to begin using Python scripts to speed up my workflow. Nonetheless, I'm glad I did!
