VOOZH about

URL: https://apify.com/orbiscribe/url-list-to-vector-jsonl

⇱ URL List to RAG & Vector JSONL Converter Β· Apify


Pricing

$1.00 / 1,000 url converted to vector jsonls

Go to Apify Store

URL List to RAG & Vector JSONL

Paste a curated URL list and get clean Markdown, document JSONL, vector chunks, ingest manifest, and failed URL report.

Pricing

$1.00 / 1,000 url converted to vector jsonls

Rating

0.0

(0)

Developer

πŸ‘ Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Share

Use this Actor when you already know the URLs you want to ingest and need a controlled conversion step for a vector database or RAG pipeline.

It fetches each public URL, extracts readable content, creates stable chunks, and writes JSONL artifacts plus a manifest so failed URLs are easy to inspect. It does not crawl around the site unless you explicitly provide more URLs.

What you get

  • Dataset rows for document records and chunks.
  • Clean Markdown, main text, headings, links, canonical URL, content hash, and output preset metadata.
  • Key-value outputs: RAG_CHUNKS_JSONL, VECTOR_CHUNKS_JSONL, DOCUMENTS_JSONL, INGEST_MANIFEST, FAILED_URLS, MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

  • Convert a curated URL export into vector-store JSONL.
  • Reprocess a known page list without a crawler wandering through the site.
  • Send failed URLs to a cleanup queue.
  • Keep document and chunk records side by side for debugging retrieval.

Input

Provide urls and choose an outputPreset such as openai_vector_store, langchain, llamaindex, pinecone, qdrant, or generic_jsonl. The preset is included in chunk metadata so downstream jobs can route or transform the JSONL.

Use includeUrlPatterns, excludeUrlPatterns, maxUrls, chunkSizeChars, and chunkOverlapChars to control scope, cost, and chunk shape. The default run processes three public Apify docs URLs so a first Store run produces real Markdown and JSONL without extra setup.

{
"urls":[
{"url":"https://docs.apify.com/academy/getting-started"},
{"url":"https://docs.apify.com/academy/web-scraping-for-beginners"},
{"url":"https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description"}
],
"outputPreset":"openai_vector_store",
"includeUrlPatterns":["/academy/"],
"excludeUrlPatterns":[],
"maxUrls":3,
"chunkSizeChars":2500,
"chunkOverlapChars":250,
"dryRun":false
}

Pricing

Recommended monetization: Pay per Event at $0.001 per vector-jsonl-url.

When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large batches.

Limits and compliance

Public URLs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. It is intentionally a URL-list converter, not a broad crawler.

You might also like

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

πŸ‘ User avatar

Nguyα»…n Anh Duy

3

4.7

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

2

Related articles

What is a vector database?
Read more