VOOZH about

URL: https://dev.to/aairom/markdown-everything-my-new-personal-project-htmlurl-to-markdown-converter-a2c

โ‡ฑ Markdown Everything: My New Personal Project: HTML/URL to Markdown Converter - DEV Community


Simplifying Documentation using IBM Bob to Create My New Personal Project: HTML/URL to Markdown Converter!

๐Ÿ‘

Introduction (and spoiler)

I know, I know โ€” take a deep breath and try to remain calm. This project features exactly zero autonomous AI agents, no sophisticated RAG pipelines, and โ€” brace yourselves โ€” not a single โ€œgroundbreakingโ€ LLM integration to disrupt the industry. Iโ€™m truly sorry to disappoint the hype-train, but as often happens when Iโ€™m left to my own devices, I had a burning need for a tool that actually does something simple. So, instead of building a sentient toaster, I decided to create my own URL-to-Markdown converter for Chrome. Itโ€™s just a humble extension designed to solve a real problem, though, in a classic display of โ€œdo as I say, not as I do,โ€ I naturally just sat back and made Bob do all the actual manual labor to realize it for me. Simple, right?


The Implementation

๐Ÿ‘

The application is structured as a modular utility that can handle both local files and live web URLs, converting messy HTML into clean, standardized Markdown. The core logic relies on the Turndown library, enhanced by custom rules for GitHub-Flavored Markdown (GFM).

Core Conversion Engine

๐Ÿ‘

At the heart of every script lies the TurndownService. The logic follows a consistent pipeline:

  • Initialization: Configures headingStyle (ATX vs Setext) and codeBlockStyle (fenced).
  • Plugin Integration: Uses turndown-plugin-gfm to ensure tables and task lists are preserved.
  • Custom Rules: Specifically targets elements like <pre> tags to ensure syntax highlighting is maintained in the output.
#!/usr/bin/env node

/**
 * Node.js CLI script to convert HTML files to Markdown
 * Usage: node scripts/convert-file.js <input-file> [output-file]
 */

const fs = require('fs');
const path = require('path');

// Load the converter
const HtmlToMarkdown = require('../src/html-to-markdown.js');

// Verify JSDOM is available for Node.js
try {
 require('jsdom');
} catch (e) {
 console.error('Error: JSDOM is required for Node.js usage.');
 console.error('Install it with: npm install jsdom');
 process.exit(1);
}

// Parse command line arguments
const args = process.argv.slice(2);

if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
 console.log(`
HTML to Markdown Converter - CLI Tool

Usage:
 node scripts/convert-file.js <input-file> [output-file]
 node scripts/convert-file.js --help

Arguments:
 input-file Path to the HTML file to convert
 output-file (Optional) Path for the output Markdown file
 If not provided, will create a timestamped file in output/

Options:
 --help, -h Show this help message

Examples:
 node scripts/convert-file.js input/sample.html
 node scripts/convert-file.js input/sample.html output/result.md
 node scripts/convert-file.js page.html
 `);
 process.exit(0);
}

const inputFile = args[0];
let outputFile = args[1];

// Check if input file exists
if (!fs.existsSync(inputFile)) {
 console.error(`Error: Input file '${inputFile}' not found.`);
 process.exit(1);
}

// Read the HTML file
console.log(`Reading HTML from: ${inputFile}`);
const html = fs.readFileSync(inputFile, 'utf8');

// Create converter instance
const converter = new HtmlToMarkdown({
 headingStyle: 'atx',
 codeBlockStyle: 'fenced',
 bulletListMarker: '-',
 strongDelimiter: '**',
 emDelimiter: '_'
});

// Convert to Markdown
console.log('Converting HTML to Markdown...');
const markdown = converter.convert(html);

// Generate output filename if not provided
if (!outputFile) {
 const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
 const inputBasename = path.basename(inputFile, path.extname(inputFile));
 outputFile = path.join('output', `${inputBasename}-${timestamp}.md`);

 // Ensure output directory exists
 const outputDir = path.dirname(outputFile);
 if (!fs.existsSync(outputDir)) {
 fs.mkdirSync(outputDir, { recursive: true });
 }
}

// Write the Markdown file
fs.writeFileSync(outputFile, markdown, 'utf8');

console.log(`โœ“ Conversion successful!`);
console.log(`Output saved to: ${outputFile}`);
console.log(`File size: ${markdown.length} characters`);

// Made with Bob

Data Acquisition Logic

The application handles three distinct entry points:

| Entry Point | Logic Flow |
| --------------- | ------------------------------------------------------------ |
| **Local File** | Reads raw HTML from the filesystem using `fs.readFileSync`. |
| **Simple URL** | Fetches remote HTML via `axios`, then parses the DOM. |
| **Complex URL** | Uses **JSDOM** to simulate a browser environment, allowing for cleaner extraction of titles and metadata before conversion. |

Image Handling & Asset Management

As detailed in Image-Handling.md, the tool doesn't just convert text; it manages media context:

  • Absolute URL Resolution: Logic converts relative paths (/img/photo.png) into absolute URLs based on the source domain.
  • Alt-Text Preservation: Ensures accessibility by mapping alt attributes to Markdown image syntax: ![alt](url).
#!/usr/bin/env node

/**
 * Node.js CLI script to convert HTML from a URL to Markdown
 * Usage: node scripts/convert-url.js <url> [output-file]
 */

const fs = require('fs');
const path = require('path');
const https = require('https');
const http = require('http');

// Load the converter
const HtmlToMarkdown = require('../src/html-to-markdown.js');

// Verify JSDOM is available
try {
 require('jsdom');
} catch (e) {
 console.error('Error: JSDOM is required for Node.js usage.');
 console.error('Install it with: npm install jsdom');
 process.exit(1);
}

// Parse command line arguments
const args = process.argv.slice(2);

if (args.length === 0 || args.includes('--help') || args.includes('-h')) {
 console.log(`
HTML to Markdown Converter - URL Fetcher

Usage:
 node scripts/convert-url.js <url> [output-file]
 node scripts/convert-url.js --help

Arguments:
 url URL of the webpage to convert
 output-file (Optional) Path for the output Markdown file
 If not provided, will create a timestamped file in output/

Options:
 --help, -h Show this help message

Examples:
 node scripts/convert-url.js https://example.com
 node scripts/convert-url.js https://example.com output/example.md
 node scripts/convert-url.js https://github.com/user/repo/blob/main/README.md
 `);
 process.exit(0);
}

const url = args[0];
let outputFile = args[1];

// Validate URL
try {
 new URL(url);
} catch (e) {
 console.error(`Error: Invalid URL '${url}'`);
 console.error('Please provide a valid URL starting with http:// or https://');
 process.exit(1);
}

/**
 * Fetch HTML content from URL
 */
function fetchUrl(url) {
 return new Promise((resolve, reject) => {
 const protocol = url.startsWith('https') ? https : http;

 console.log(`Fetching content from: ${url}`);

 const request = protocol.get(url, {
 headers: {
 'User-Agent': 'Mozilla/5.0 (compatible; HTML-to-Markdown-Converter/1.0)'
 }
 }, (response) => {
 // Handle redirects
 if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) {
 console.log(`Following redirect to: ${response.headers.location}`);
 fetchUrl(response.headers.location).then(resolve).catch(reject);
 return;
 }

 if (response.statusCode !== 200) {
 reject(new Error(`HTTP ${response.statusCode}: ${response.statusMessage}`));
 return;
 }

 let data = '';
 response.on('data', chunk => data += chunk);
 response.on('end', () => resolve(data));
 });

 request.on('error', reject);
 request.setTimeout(30000, () => {
 request.destroy();
 reject(new Error('Request timeout after 30 seconds'));
 });
 });
}

/**
 * Extract domain name from URL for filename
 */
function getDomainName(url) {
 try {
 const urlObj = new URL(url);
 return urlObj.hostname.replace(/^www\./, '');
 } catch (e) {
 return 'webpage';
 }
}

// Main execution
(async () => {
 try {
 // Fetch HTML from URL
 const html = await fetchUrl(url);

 if (!html || html.trim().length === 0) {
 console.error('Error: No content received from URL');
 process.exit(1);
 }

 console.log(`โœ“ Content fetched (${html.length} characters)`);

 // Create converter instance
 const converter = new HtmlToMarkdown({
 headingStyle: 'atx',
 codeBlockStyle: 'fenced',
 bulletListMarker: '-',
 strongDelimiter: '**',
 emDelimiter: '_'
 });

 // Convert to Markdown
 console.log('Converting HTML to Markdown...');
 const markdown = converter.convert(html);

 // Generate output filename if not provided
 if (!outputFile) {
 const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
 const domain = getDomainName(url);
 outputFile = path.join('output', `${domain}-${timestamp}.md`);

 // Ensure output directory exists
 const outputDir = path.dirname(outputFile);
 if (!fs.existsSync(outputDir)) {
 fs.mkdirSync(outputDir, { recursive: true });
 }
 }

 // Write the Markdown file
 fs.writeFileSync(outputFile, markdown, 'utf8');

 console.log(`โœ“ Conversion successful!`);
 console.log(`Output saved to: ${outputFile}`);
 console.log(`File size: ${markdown.length} characters`);
 console.log(`\nPreview (first 500 characters):`);
 console.log('โ”€'.repeat(60));
 console.log(markdown.substring(0, 500) + (markdown.length > 500 ? '...' : ''));
 console.log('โ”€'.repeat(60));

 } catch (error) {
 console.error(`\nโœ— Error: ${error.message}`);

 if (error.code === 'ENOTFOUND') {
 console.error('Could not resolve hostname. Please check the URL and your internet connection.');
 } else if (error.code === 'ECONNREFUSED') {
 console.error('Connection refused. The server may be down or unreachable.');
 } else if (error.message.includes('timeout')) {
 console.error('Request timed out. The server may be slow or unresponsive.');
 }

 process.exit(1);
 }
})();

// Made with Bob

Application Workflow

๐Ÿ‘

The following logic flow governs the transition from a URL to a downloaded Markdown file:

  • Request Phase: User provides a URL or file path.
  • Extraction Phase: The system fetches the content: it isolates the main content body (often stripping headers/footers if logic is applied).
  • Transformation Phase: HTML tags are mapped to Markdown equivalents (e.g.,

    becomes #) : code blocks are wrapped in triple backticks.

  • Finalization: Metadata (Title, Date, Source URL) is prepended as Front Matter: the resulting string is saved as a .md file.

Last but not Least: Extension for a Browser

The extension follows the standard Chrome โ€œV3โ€ architecture, distributed across three main functional areas:

The Manifest (Orchestrator)

  • File: manifest.json
  • Role: Defines the permissions and entry points. It specifically requests activeTab access to read the page content and downloads permissions to save the final .md file to your computer.

The Popup (User Interface)

  • Files: popup.html, popup.js
/**
 * Chrome Extension Popup Script
 * Handles user interactions and conversion logic
 */

let currentMarkdown = '';

// Initialize when popup opens
document.addEventListener('DOMContentLoaded', () => {
 const convertBtn = document.getElementById('convert-btn');
 const copyBtn = document.getElementById('copy-btn');
 const downloadBtn = document.getElementById('download-btn');
 const markdownOutput = document.getElementById('markdown-output');
 const headingStyle = document.getElementById('heading-style');
 const codeStyle = document.getElementById('code-style');

 // Convert button click
 convertBtn.addEventListener('click', async () => {
 try {
 showStatus('Converting...', 'info');
 convertBtn.disabled = true;

 // Get the active tab
 const [tab] = await chrome.tabs.query({ active: true, currentWindow: true });

 // Execute script in the page to get HTML
 const results = await chrome.scripting.executeScript({
 target: { tabId: tab.id },
 function: getPageHTML
 });

 if (results && results[0] && results[0].result) {
 const html = results[0].result;

 // Convert HTML to Markdown
 const converter = new HtmlToMarkdown({
 headingStyle: headingStyle.value,
 codeBlockStyle: codeStyle.value
 });

 currentMarkdown = converter.convert(html);
 markdownOutput.value = currentMarkdown;

 // Enable buttons
 copyBtn.disabled = false;
 downloadBtn.disabled = false;

 showStatus('โœ“ Conversion successful!', 'success');
 } else {
 throw new Error('Could not retrieve page content');
 }
 } catch (error) {
 console.error('Conversion error:', error);
 showStatus('โœ— Error: ' + error.message, 'error');
 } finally {
 convertBtn.disabled = false;
 }
 });

 // Copy button click
 copyBtn.addEventListener('click', async () => {
 try {
 await navigator.clipboard.writeText(currentMarkdown);
 showStatus('โœ“ Copied to clipboard!', 'success');

 // Visual feedback
 const originalText = copyBtn.textContent;
 copyBtn.textContent = 'โœ“ Copied!';
 setTimeout(() => {
 copyBtn.textContent = originalText;
 }, 2000);
 } catch (error) {
 console.error('Copy error:', error);
 showStatus('โœ— Failed to copy', 'error');
 }
 });

 // Download button click
 downloadBtn.addEventListener('click', () => {
 try {
 const blob = new Blob([currentMarkdown], { type: 'text/markdown' });
 const url = URL.createObjectURL(blob);
 const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5);
 const filename = `converted-${timestamp}.md`;

 const a = document.createElement('a');
 a.href = url;
 a.download = filename;
 document.body.appendChild(a);
 a.click();
 document.body.removeChild(a);
 URL.revokeObjectURL(url);

 showStatus('โœ“ Downloaded!', 'success');
 } catch (error) {
 console.error('Download error:', error);
 showStatus('โœ— Failed to download', 'error');
 }
 });

 // Update conversion when options change
 headingStyle.addEventListener('change', () => {
 if (currentMarkdown) {
 convertBtn.click();
 }
 });

 codeStyle.addEventListener('change', () => {
 if (currentMarkdown) {
 convertBtn.click();
 }
 });
});

/**
 * Function to be executed in the page context
 * Gets the HTML content of the page
 */
function getPageHTML() {
 // Get the main content, preferring article or main tags
 const article = document.querySelector('article');
 const main = document.querySelector('main');
 const body = document.body;

 // Return the most relevant content
 if (article) {
 return article.outerHTML;
 } else if (main) {
 return main.outerHTML;
 } else {
 return body.innerHTML;
 }
}

/**
 * Show status message
 */
function showStatus(message, type) {
 const status = document.getElementById('status');
 status.textContent = message;
 status.className = `status ${type}`;
 status.style.display = 'block';

 // Auto-hide after 3 seconds for success messages
 if (type === 'success') {
 setTimeout(() => {
 status.style.display = 'none';
 }, 3000);
 }
}

// Made with Bob
  • Logic: This is the โ€œcontrol center.โ€
  • Configuration: It allows users to toggle settings like โ€œHeading Styleโ€ (ATX vs. Setext) and โ€œCode Block Styleโ€ (Fenced vs. Indented) directly from a dropdown.
  • Communication: When the โ€œConvertโ€ button is clicked, popup.js sends a message to the content script to grab the page data.
  • Trigger: Once the Markdown is returned, it creates a Blob and triggers a download via the Chrome Downloads API.

The Content Script (The โ€œBrainโ€)

  • File: content.js
  • Logic: This script lives inside the webpage you are viewing.
  • Extraction: It scrapes the current DOM, targeting the document.body.innerHTML.
  • Transformation: It bundles the Turndown library logic to convert that HTML string into Markdown on the fly.
  • Metadata: It automatically grabs the page <title> and URL to create a header for your document.
/**
 * Content Script
 * Runs in the context of web pages
 * Can be used for additional features like context menu conversion
 */

// Listen for messages from the extension
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
 if (request.action === 'getHTML') {
 // Get the HTML content
 const html = document.documentElement.outerHTML;
 sendResponse({ html });
 }
 return true;
});

// Optional: Add keyboard shortcut for quick conversion
document.addEventListener('keydown', (e) => {
 // Ctrl+Shift+M or Cmd+Shift+M to trigger conversion
 if ((e.ctrlKey || e.metaKey) && e.shiftKey && e.key === 'M') {
 e.preventDefault();
 // Open the extension popup programmatically
 chrome.runtime.sendMessage({ action: 'openPopup' });
 }
});

console.log('HTML to Markdown extension loaded');

// Made with Bob

Technical Stack Summary

  • Runtime: Node.js
  • Parsing: jsdom (for DOM manipulation)
  • Conversion: turndown + turndown-plugin-gfm
  • HTTP Client: axios
  • CLI Assets: create-icons.sh (for generating Chrome Extension icons)

Note from Bob: The logic is designed to be โ€œplug-and-play.โ€ Whether you are running a script in the terminal or clicking the extension button, the underlying conversion rules remain identical to ensure consistent output quality.


Conclusion

To wrap this all up, there is a distinct, borderline-obsessive satisfaction in using a tool where you know every single line of the source code. Sure, there are dozens of converters out there, but this one is my precious code, and that makes it inherently superior.

The real kicker? The entire transition from โ€œI have a burning needโ€ to a fully functional Chrome extension and CLI suite happened in less than 30 minutes. By offloading the heavy lifting to Bob โ€” who, per my strict instructions, ensured everything was backed by unit tests โ€” the development cycle was essentially at warp speed.

๐Ÿ‘

>>> Thanks for reading <<<


Links