VOOZH about

URL: https://dev.to/murroughfoley/how-to-use-rs-trafilatura-with-scrapy-1i9b

⇱ How to Use rs-trafilatura with Scrapy - DEV Community


Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically.

Install

pip install rs-trafilatura scrapy

Setup

Add the pipeline to your Scrapy project's settings.py:

ITEM_PIPELINES = {
 "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}

That's it. Every item that passes through the pipeline with a body (bytes) or html (string) field will get an extraction dict added to it.

Writing the Spider

Your spider yields items with the response body and URL:

import scrapy

class ContentSpider(scrapy.Spider):
 name = "content"
 start_urls = ["https://example.com"]

 def parse(self, response):
 yield {
 "url": response.url,
 "body": response.body, # raw bytes — rs-trafilatura auto-detects encoding
 }

 # Follow links
 for href in response.css("a::attr(href)").getall():
 yield response.follow(href, self.parse)

The pipeline picks up body (bytes) or html (string). When it finds one, it runs extraction and adds the results under item["extraction"].

What the Pipeline Adds

Each processed item gets an extraction dict:

{
 "url": "https://example.com/blog/post",
 "body": b"<html>...",
 "extraction": {
 "title": "Blog Post Title",
 "author": "John Doe",
 "date": "2026-01-15T00:00:00+00:00",
 "main_content": "The full extracted text...",
 "content_markdown": "# Blog Post Title\n\nThe full extracted text...",
 "page_type": "article",
 "extraction_quality": 0.95,
 "language": "en",
 "sitename": "Example Blog",
 "description": "A blog post about...",
 }
}

Enabling Markdown Output

Add to settings.py:

RS_TRAFILATURA_MARKDOWN = True

This populates item["extraction"]["content_markdown"] with GitHub Flavored Markdown.

Filtering by Page Type

The page type classification lets you route items differently based on what kind of page they are:

class ContentSpider(scrapy.Spider):
 name = "content"
 start_urls = ["https://example.com"]

 custom_settings = {
 "ITEM_PIPELINES": {
 "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
 "myproject.pipelines.PageTypeRouter": 400,
 },
 }

 def parse(self, response):
 yield {"url": response.url, "body": response.body}
 for href in response.css("a::attr(href)").getall():
 yield response.follow(href, self.parse)
# myproject/pipelines.py
class PageTypeRouter:
 def process_item(self, item, spider):
 ext = item.get("extraction", {})
 page_type = ext.get("page_type", "article")

 if page_type == "product":
 # Save to products table
 save_product(item)
 elif page_type == "forum":
 # Save to discussions table
 save_forum_post(item)
 elif page_type == "article":
 # Save to articles table
 save_article(item)
 else:
 # Default handling
 save_generic(item)

 return item

Filtering by Extraction Quality

Drop items where extraction quality is low:

class QualityFilter:
 def process_item(self, item, spider):
 ext = item.get("extraction", {})
 quality = ext.get("extraction_quality", 0)

 if quality < 0.5:
 raise scrapy.exceptions.DropItem(
 f"Low extraction quality ({quality:.2f}): {item['url']}"
 )

 return item

Add it before the router:

ITEM_PIPELINES = {
 "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
 "myproject.pipelines.QualityFilter": 350,
 "myproject.pipelines.PageTypeRouter": 400,
}

Exporting to JSON Lines

Scrapy's built-in feed exports work out of the box:

scrapy crawl content -o output.jsonl

Each line in output.jsonl will contain the full item including the extraction dict. You can then process it with any tool that reads JSON Lines.

Performance

rs-trafilatura extracts in ~44ms per page via compiled Rust (PyO3, no subprocess). On a typical Scrapy crawl, extraction adds negligible overhead compared to network latency. The pipeline processes items synchronously in the Scrapy reactor thread, but since extraction is CPU-bound and fast, it doesn't block the download pipeline.

For very high-throughput crawls (1000+ pages/second), consider running extraction in a separate process and communicating via Scrapy's item pipeline.

Items Without HTML

If an item doesn't have body or html, the pipeline passes it through unchanged:

# This item has no HTML — pipeline ignores it
yield {"url": response.url, "custom_data": "something"}
# → No "extraction" key added, item passes through as-is

Links