VOOZH about

URL: https://dev.to/murroughfoley/how-to-use-rs-trafilatura-with-crawl4ai-3nfd

⇱ How to Use rs-trafilatura with crawl4ai - DEV Community


crawl4ai is an async web crawler built for producing LLM-friendly output. By default, it converts pages to Markdown using its own scraping pipeline. But if you want page-type-aware content extraction with quality scoring, you can swap in rs-trafilatura as the extraction strategy.

This tutorial shows how to set that up.

Install

pip install rs-trafilatura crawl4ai

If this is your first time with crawl4ai, you also need Playwright browsers:

python -m playwright install chromium

Basic Usage

rs-trafilatura provides RsTrafilaturaStrategy, a drop-in replacement for crawl4ai's built-in extraction strategies:

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

async def main():
 strategy = RsTrafilaturaStrategy()
 config = CrawlerRunConfig(extraction_strategy=strategy)

 async with AsyncWebCrawler() as crawler:
 result = await crawler.arun(url="https://example.com", config=config)

 data = json.loads(result.extracted_content)
 item = data[0]

 print(f"Title: {item['title']}")
 print(f"Page type: {item['page_type']}")
 print(f"Quality: {item['extraction_quality']}")
 print(f"Content: {item['main_content'][:200]}")

asyncio.run(main())

The extracted content is a JSON array with one item containing the extraction result. crawl4ai serialises it automatically — you just json.loads() the extracted_content field.

What You Get Back

Each extraction result is a dict with these fields:

Field Description
title Page title
author Author name (if detected)
date Publication date (ISO 8601)
main_content Clean extracted text
content_markdown Markdown output (if enabled)
page_type article, forum, product, collection, listing, documentation, service
extraction_quality 0.0–1.0 confidence score
language Detected language
sitename Site name
description Meta description

Enabling Markdown Output

Pass output_markdown=True to get Markdown alongside plain text:

strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)

async with AsyncWebCrawler() as crawler:
 result = await crawler.arun(url="https://example.com", config=config)

data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]

This gives you GitHub Flavored Markdown with headings, lists, tables, bold/italic, code blocks, and links preserved.

Precision vs Recall

By default, rs-trafilatura balances precision and recall. You can tip the scale:

# Stricter filtering — less noise, may miss some content
strategy = RsTrafilaturaStrategy(favor_precision=True)

# More inclusive — captures more content, may include some boilerplate
strategy = RsTrafilaturaStrategy(favor_recall=True)

Crawling Multiple Pages

crawl4ai handles concurrency. rs-trafilatura runs extraction in a thread per page, so it doesn't block the async crawl loop:

async def main():
 strategy = RsTrafilaturaStrategy(output_markdown=True)
 config = CrawlerRunConfig(extraction_strategy=strategy)

 urls = [
 "https://example.com/blog/post-1",
 "https://example.com/products/widget",
 "https://example.com/docs/getting-started",
 "https://forum.example.com/thread/123",
 ]

 async with AsyncWebCrawler() as crawler:
 for url in urls:
 result = await crawler.arun(url=url, config=config)
 data = json.loads(result.extracted_content)
 item = data[0]
 print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")

Each page gets classified into its type and extracted with the appropriate profile. A product page gets JSON-LD fallback. A forum thread gets comment-as-content handling. A docs page gets sidebar removal. All automatic.

Using the Quality Score for Hybrid Pipelines

The extraction_quality field tells you how confident rs-trafilatura is in its extraction. You can use this to build a hybrid pipeline — fast heuristic extraction for most pages, with LLM fallback for the hard cases:

from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def extract_with_fallback(crawler, url, config):
 result = await crawler.arun(url=url, config=config)
 data = json.loads(result.extracted_content)
 item = data[0]

 if item["extraction_quality"] < 0.80:
 # Low confidence — use crawl4ai's built-in LLM extraction as fallback
 llm_config = CrawlerRunConfig(
 extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o-mini")
 )
 result = await crawler.arun(url=url, config=llm_config)
 return result.extracted_content

 return item["main_content"]

On the WCXB benchmark, about 8% of pages score below 0.80. Routing just those pages to a neural fallback improves the overall F1 from 0.859 to 0.862 on the development set and from 0.893 to 0.910 on the held-out test set.

How It Works Under the Hood

RsTrafilaturaStrategy inherits from crawl4ai's ExtractionStrategy when crawl4ai is installed, so it passes the isinstance() check in CrawlerRunConfig. It sets input_format="html" which tells crawl4ai to pass raw HTML (not Markdown) and to skip chunking. The extraction runs in Rust via PyO3 — no subprocess, no binary to find.

Links