VOOZH about

URL: https://apify.com/awesome_highboy/rss-news-structurer

⇱ RSS & Atom Feeds to RAG Markdown for Embeddings Β· Apify


πŸ‘ RSS & Atom Feeds to RAG Markdown Chunks avatar

RSS & Atom Feeds to RAG Markdown Chunks

Under maintenance

Pricing

Pay per event

Go to Apify Store

RSS & Atom Feeds to RAG Markdown Chunks

Under maintenance

Turn RSS/Atom feeds into full-article clean Markdown + token-bounded RAG chunks for embeddings & vector DBs. sha256 cross-item dedup means you pay only for net-new articles, not syndicated copies. Robots honored; ownership-gated.

Pricing

Pay per event

Rating

0.0

(0)

Developer

πŸ‘ Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

8 days ago

Last modified

Share

RSS & Atom Feeds to RAG Markdown

Turn RSS and Atom feeds into RAG-ready Markdown chunks, with content-hash deduplication so you only pay for net-new articles.

What it does

For each feed URL you supply, this Actor:

  1. Fetches the feed and parses it into normalized items (title, link, guid, isoDate). Both RSS <item> (with <link>text</link>) and Atom <entry> (with <link href="...">) layouts are supported. Fields that are not present in the feed are left null β€” nothing is invented.
  2. Computes a deterministic sha256 content hash for each item and compares it against a cross-run seen-hash snapshot kept in the Actor's key-value store, so already-processed items are skipped as duplicates.
  3. For each net-new item, fetches the linked article, cleans the HTML down to plain Markdown (scripts/styles/tags stripped), and splits it into token-bounded chunks via the shared chunker.
  4. Emits one chunk record per chunk, carrying the source URL, feed item GUID, chunk text, token count, content hash, and the extracted feed fields.
  5. Writes one run_summary record with the run totals.

Feeds or articles that fail to fetch are skipped (and not billed), so one bad URL never fails the whole run.

Input

FieldTypeRequiredDescription
feedUrlsarrayYesPublic RSS or Atom feed URLs you are authorized to read.
chunkingobjectNo{ maxTokens, overlapTokens } for token-bounded chunking. Defaults: maxTokens 512, overlapTokens 64.
ownership_attestationbooleanYesYou must confirm you are authorized to fetch the supplied feeds and their linked articles. The run is rejected before any work or billing if this is not true.

Output

Every record has a record_type field.

chunk β€” one per emitted RAG chunk:

FieldTypeDescription
record_typestringAlways "chunk".
source_urlstringThe article URL the chunk came from.
feed_item_guidstring | nullThe feed item GUID/id (or null if the feed omitted it).
chunk_indexintegerZero-based index of this chunk within its article.
chunk_textstringThe chunk's Markdown text.
token_countintegerEstimated token count of the chunk (never exceeds maxTokens).
content_hashstringsha256:<64 hex> hash of the chunk text.
extracted_fieldsobjectThe feed fields extracted for the item (title, link, guid, isoDate); absent fields are null.

run_summary β€” exactly one per run:

FieldTypeDescription
record_typestringAlways "run_summary".
items_in_feedintegerTotal feed items seen across all feeds.
articles_fetchedintegerNet-new articles fetched and chunked.
duplicates_skippedintegerItems skipped because their content hash was already seen.

Pricing

Pay-Per-Event:

EventWhen it firesPrice
actor_run_startOnce per run, after the gates pass$0.02
article_processedPer net-new article fetched, cleaned, and chunked$0.008
field_extractedPer non-null structured field extracted from a feed item$0.004

Duplicates skipped by content hash are not billed β€” you only pay for net-new work.

Example: a feed with 5 net-new articles, each with 4 fields = $0.02 + 5 x $0.008 + 20 x $0.004 = $0.14.

Why this Actor

  • Deterministic and idempotent. Feed parsing, hashing, dedup, and chunking are pure functions; the same feed yields the same content hashes every run, so you can cache and detect changes safely.
  • Content-hash dedup across runs. A key-value seen-hash snapshot means re-running a feed only processes (and bills) genuinely new items.
  • No hallucination. Item fields come straight from the feed XML; missing fields are null, never fabricated. There are no LLM calls and no API keys.
  • Pre-chunked for RAG. Output is already split into bounded chunks with stable hashes, ready to embed.

About

This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is used only to bill the customer for the Pay-Per-Event units described above β€” the Actor has no payout or money-out capability of any kind.

You might also like

RAG Web Crawler: Clean Markdown + Token-Sized Chunks

commonelements/rag-ready-crawler

Turn any website into embeddings-ready chunks for RAG and vector databases. Structure-aware token-sized chunking, clean LLM-ready markdown, per-chunk citations and metadata, dedup, and junk filtering. Pay per result, no surprise compute bills.

πŸ‘ User avatar

Harry Schoeller

2

RSS Feed Reader - Bulk RSS & Atom Feed Parser

logiover/bulk-rss-feed-reader

Read and parse RSS, Atom and RDF feeds in bulk, or auto-discover feeds from any website. Extract thousands of articles with full metadata for news monitoring, content aggregation and AI/RAG pipelines. No API key, export to CSV or JSON.

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summaryβ€”ready for embeddings or vector DBs without extra glue code.

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

RSS to JSON

phazonoverload/rss-to-json

Converts RSS/Atom feeds to clean JSON output, preserving all feed metadata and item attributes.

PDF to Markdown & JSON (RAG-Ready)

basisweb/pdf-to-markdown-rag

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdownβ€”ready for RAG, embeddings, and AI agents.

πŸ‘ User avatar

Dev with Bobby

11