VOOZH about

URL: https://dev.to/murroughfoley/how-to-use-rs-trafilatura-with-spider-rs-de4

⇱ How to Use rs-trafilatura with spider-rs - DEV Community


spider is a high-performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs-trafilatura slots in as the extraction layer, giving you page-type-aware content extraction with quality scoring on every crawled page.

Setup

Add both crates to your Cargo.toml:

[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }

The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider's Page type directly.

Basic: Crawl Then Extract

The simplest approach — crawl a site, then extract content from every page:

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
 let mut website = Website::new("https://example.com");
 website.crawl().await;

 for page in website.get_pages().into_iter().flatten() {
 match extract_page(&page) {
 Ok(result) => {
 println!("[{}] {} (confidence: {:.2})",
 result.metadata.page_type.unwrap_or_default(),
 result.metadata.title.unwrap_or_default(),
 result.extraction_quality,
 );
 println!(" Content: {} chars", result.content_text.len());
 }
 Err(e) => eprintln!(" Extraction failed: {e}"),
 }
 }
}

extract_page takes a &Page and returns Result<ExtractResult>. The page URL is automatically passed to the classifier for page type detection.

Streaming: Extract As Pages Arrive

For large crawls, you don't want to wait until everything is fetched. spider's subscribe channel lets you process pages as they arrive:

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
 let mut website = Website::new("https://example.com");
 let mut rx = website.subscribe(0).unwrap();

 let handle = tokio::spawn(async move {
 let mut count = 0;
 while let Ok(page) = rx.recv().await {
 if let Ok(result) = extract_page(&page) {
 count += 1;
 println!("[{count}] {} → {} ({:.2})",
 page.get_url(),
 result.metadata.page_type.unwrap_or_default(),
 result.extraction_quality,
 );
 }
 }
 println!("Extracted {count} pages");
 });

 website.crawl().await;
 website.unsubscribe();
 let _ = handle.await;
}

Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44ms per page, so it easily keeps up with typical crawl rates.

Custom Options

Use extract_page_with_options for fine-grained control:

use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;

let options = Options {
 output_markdown: true, // Get GFM Markdown output
 include_images: true, // Extract image metadata
 favor_precision: true, // Stricter filtering
 page_type: Some(PageType::Product), // Force page type
 ..Options::default()
};

let result = extract_page_with_options(&page, &options)?;

if let Some(md) = &result.content_markdown {
 println!("Markdown:\n{}", md);
}

for img in &result.images {
 println!("Image: {} (hero: {})", img.src, img.is_hero);
}

If you provide url in the options, it takes precedence over the page URL for classification. If you don't, the page URL is used automatically.

Quality-Gated Processing

The extraction quality score lets you filter or flag low-confidence results:

for page in website.get_pages().into_iter().flatten() {
 let url = page.get_url().to_string();
 let result = extract_page(&page)?;

 if result.extraction_quality < 0.80 {
 eprintln!("⚠ Low confidence on {url}: {:.2}", result.extraction_quality);
 // Log for manual review, or route to a fallback extractor
 continue;
 }

 // Process high-confidence extractions
 save_to_database(&result);
}

On the WCXB benchmark, about 8% of pages score below 0.80. These are typically product pages with content in JSON-LD, forums with unusual markup, or service pages with highly distributed content.

What extract_page Returns

ExtractResult gives you:

Field Type Description
content_text String Main content as plain text
content_markdown Option<String> GFM Markdown (when enabled)
content_html Option<String> Extracted content as HTML
metadata.title Option<String> Page title
metadata.author Option<String> Author name
metadata.date Option<DateTime> Publication date
metadata.page_type Option<String> Detected page type
extraction_quality f64 0.0–1.0 confidence score
images Vec<ImageData> Image URLs, alt text, captions

Why Not spider_transformations?

spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it's a basic readability-style extractor without:

  • ML page type classification
  • Type-specific extraction profiles (forum comment handling, multi-section merge, JSON-LD fallback)
  • Extraction quality scoring
  • Structured metadata extraction from JSON-LD, Open Graph, and Dublin Core

rs-trafilatura gives you all of these. For article-heavy crawls, spider_transformations is fine. For crawls that hit diverse page types, rs-trafilatura produces substantially better results.

Links