👁 bioRxiv + medRxiv Scraper for RAG avatar

bioRxiv + medRxiv Scraper for RAG

Pricing

from $20.00 / 1,000 papers

bioRxiv + medRxiv Scraper for RAG

Scrape bioRxiv and medRxiv preprints by server, category, and date range. Returns RAG-ready JSON with JATS full-text chunks (cl100k_base, 512/50) when available and abstract fallback otherwise. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. $0.02 per preprint.

Pricing

from $20.00 / 1,000 papers

Rating

0.0

(0)

Developer

👁 GetAScraper

GetAScraper

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

bioRxiv and medRxiv scraper for RAG: chunked JSON

Scrape bioRxiv and medRxiv preprints into RAG-ready JSON in one call. Pulls preprints by server, category, and posting-date range. Fetches the JATS full-text XML when available and falls back to the abstract otherwise. Returns fixed-token chunks (512 tokens, 50 overlap) with full metadata, ready to drop into LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, or Chroma. Built for biomedical AI teams, pharma/biotech researchers, drug-discovery AI, and clinical-evidence tooling.

What does this Actor do?

This Apify Actor scrapes bioRxiv and medRxiv preprints matching your server, category, and date window, fetches the JATS full-text XML where available, and splits the resulting plain text into tokenizer-aware chunks (512 tokens, 50-token overlap, tiktoken cl100k_base) ready to embed or feed into a RAG index.

Each output record contains clean metadata (DOI, server, version, title, authors, category, posting date, license) and a chunks array of { idx, text, tokens } ready for direct ingestion into a vector database.

Try it in the Apify Console. Pick one or both servers, an optional category slug, a date range, a preprint cap, and hit Start. Download results as JSON, CSV, or Excel.

Built on the Apify platform, you also get: scheduled runs, HTTP API access, integrations with Zapier, Make, and Zapier, proxy rotation, monitoring, and alerts. No infrastructure to run yourself.

Why use bioRxiv and medRxiv scraper for RAG?

Preprints, not published-only: The fastest-moving biomedical evidence is on bioRxiv and medRxiv months before it reaches PubMed.
Skip the JATS XML grind: Clean preprint records, not raw <article> trees with boilerplate to strip.
Full-text when it exists: Every record carries a jatsxml URL. When the XML parses and produces useful prose, source: "full_text" lands. Abstract fallback otherwise. See the Limits section for the realistic coverage rate.
Both servers, one run: Pass servers: ["biorxiv", "medrxiv"] and filter downstream by the server field on each record.
Category filtering per server: bioRxiv and medRxiv use different taxonomies, both supported.
Pre-chunked for RAG: tiktoken cl100k_base tokenization, compatible with OpenAI text-embedding-3, Claude, Cohere, and most BGE/E5/nomic embedding models.
Vector-DB neutral: Drop into Qdrant, Pinecone, Weaviate, pgvector (Supabase / Neon), Chroma, or Milvus without reformatting.
Framework-ready: Works with LangChain, LlamaIndex, Haystack, or LangGraph.
Respectful rate limiting: 3 requests per second total across both servers. No API key needed.
Cheap: $0.02 per preprint. A month of medRxiv oncology (~300 preprints) costs around $6.

How to use bioRxiv and medRxiv scraper for RAG

Open the Actor in Apify Console.
Pick servers (one or both of biorxiv, medrxiv).
Set category to an optional, server-specific slug. Leave empty for all categories.
Set dateFrom / dateTo in YYYY-MM-DD format.
Set maxPreprints to cap the run. The cap is global across both servers.
Click Start. Expect roughly 100 to 200 preprints per minute under the 3 req/s ceiling.
Download results from the Storage tab.

Input

Field	Type	Required	Description
`servers`	array of strings	No	One or both of `biorxiv`, `medrxiv`. Default: `["biorxiv", "medrxiv"]`.
`category`	string	No	Server-specific category slug. Empty matches all categories.
`dateFrom`	string	Yes	Inclusive posting-date lower bound in YYYY-MM-DD format. Default: `"2024-01-01"`.
`dateTo`	string	Yes	Inclusive posting-date upper bound in YYYY-MM-DD format. Default: `"2024-01-02"`.
`maxPreprints`	integer	No	Global cap across both servers (1 to 100000). Default: `10`.

Example input (combined run):

{
"servers":["biorxiv","medrxiv"],
"category":"",
"dateFrom":"2024-01-01",
"dateTo":"2024-01-02",
"maxPreprints":10
}

Example input (medRxiv oncology only):

{
"servers":["medrxiv"],
"category":"oncology",
"dateFrom":"2024-01-01",
"dateTo":"2024-01-31",
"maxPreprints":500
}

Category slugs

bioRxiv: animal_behavior_and_cognition, biochemistry, bioengineering, bioinformatics, biophysics, cancer_biology, cell_biology, developmental_biology, ecology, evolutionary_biology, genetics, genomics, immunology, microbiology, molecular_biology, neuroscience, paleontology, pathology, pharmacology_and_toxicology, physiology, plant_biology, scientific_communication_and_education, synthetic_biology, systems_biology, zoology.

medRxiv: addiction_medicine, allergy_and_immunology, anesthesia, cardiovascular_medicine, dentistry_and_oral_medicine, dermatology, emergency_medicine, endocrinology, epidemiology, gastroenterology, genetic_and_genomic_medicine, geriatric_medicine, health_economics, health_informatics, health_policy, health_systems_and_quality_improvement, hematology, hiv_aids, infectious_diseases, intensive_care_and_critical_care_medicine, medical_education, medical_ethics, nephrology, neurology, nursing, nutrition, obstetrics_and_gynecology, occupational_and_environmental_health, oncology, ophthalmology, orthopedics, otolaryngology, pain_medicine, palliative_medicine, pathology, pediatrics, pharmacology_and_therapeutics, primary_care_research, psychiatry_and_clinical_psychology, public_and_global_health, radiology_and_imaging, rehabilitation_medicine_and_physical_therapy, respiratory_medicine, rheumatology, sexual_and_reproductive_health, sports_medicine, surgery, toxicology, transplantation, urology.

Category slug mismatch warning. Setting category: "neuroscience" with servers: ["medrxiv"] returns zero medRxiv records because medRxiv has no neuroscience slug. The Actor logs a warning in this case but does not fail. Split the run into two calls or leave category empty if you want everything.

Output

Each preprint becomes one dataset item. You can download the dataset in JSON, HTML, CSV, or Excel.

{
"doi":"10.1101/2024.03.15.585219",
"server":"biorxiv",
"version":"1",
"title":"A concise title here",
"abstract":"The abstract as returned by the bioRxiv API.",
"authors":["Smith, J.","Doe, J."],
"category":"neuroscience",
"publication_date":"2024-03-15",
"preprint_url":"https://www.biorxiv.org/content/10.1101/2024.03.15.585219v1",
"license":"cc_by",
"source":"full_text",
"chunks":[
{"idx":0,"text":"...","tokens":487},
{"idx":1,"text":"...","tokens":512}
]
}

source is "full_text" when JATS XML parsed into useful prose, "abstract" when it fell back.

Data table

Field	Type	Description
`doi`	string	DOI (primary identifier, e.g. `10.1101/2024.03.15.585219`)
`server`	`"biorxiv"` \| `"medrxiv"`	Which server the preprint came from
`version`	string	Preprint version returned by the API (latest at fetch time)
`title`	string	Preprint title
`abstract`	string	Abstract as returned by the bioRxiv API
`authors`	string[]	Author display names in the order the API returned them
`category`	string	Server-specific category slug
`publication_date`	ISO date	`YYYY-MM-DD` posting date
`preprint_url`	string	Canonical preprint landing page
`license`	string?	Normalized license key: `cc_by`, `cc_by_nc`, `cc_by_nd`, `cc_by_nc_nd`, `cc0`, `none`, or null
`source`	`"full_text"` \| `"abstract"`	Text origin
`chunks`	Chunk[]	Token-aware chunks for RAG
`chunks[].idx`	number	0-indexed position
`chunks[].text`	string	Chunk text
`chunks[].tokens`	number	Token count (≤ 512)

Pricing

$0.02 per preprint (PPR, pay per result).

How much does it cost to scrape bioRxiv and medRxiv?

Volume	Estimated cost
10 preprints	~$0.20
100 preprints	~$2.00
1,000 preprints	~$20.00
10,000 preprints	~$200.00
100,000 preprints	~$2,000.00

No subscription. No minimum. You pay only for successful records.

Limits you should know before you run

Full-text coverage is roughly 40 to 80 percent of returned records: bioRxiv and medRxiv publish JATS XML for most recent preprints, but availability varies by category, server, and how recently the preprint was posted. The remaining records fall back to abstract-only (source: "abstract"). Budget your ingest pipeline with this in mind.
bioRxiv is behind Cloudflare: The Actor handles security challenges in the background. Occasional transient 403s are retried automatically.
Only the latest version of each preprint is returned. Version history is a v2 feature.
No figure or table extraction: Captions stay inline as text inside body chunks. Figure and table content is dropped during the JATS strip pass.
No citation graph: Reference lists are stripped from body text to keep chunks dense. Reference extraction is a v2 feature.
No section-aware chunking: Chunks are fixed-token (512 with 50 overlap). Section-level splitting (Abstract / Introduction / Methods / Results / Discussion) is deferred.

Tips

Split large backfills into month-sized windows and run them in parallel Apify runs. The 3 req/s limiter is per-run, so parallel runs scale linearly.
Pair with PubMed RAG Extractor for the fast + validated flow: preprints today, peer-reviewed tomorrow. See the sister Actor in this portfolio.
Track the same DOIs over time by running weekly and diffing on doi. New versions appear as new records with higher version numbers.
source: "abstract" records are still useful: Abstracts are dense, well-structured, and often the most information-rich section of a preprint.

Publish your output as a HuggingFace dataset

If you extract a category-scoped corpus (e.g. every bioRxiv immunology preprint from 2024), consider publishing the output as a HuggingFace dataset:

pip install datasets
# Then, in Python:
# from datasets import Dataset
# ds = Dataset.from_json("output.json")
# ds.push_to_hub("your-username/biorxiv-immunology-2024")

Disclaimers and support

Disclaimer: This actor retrieves publicly available product data. Make sure your usage complies with applicable guidelines. This tool is not affiliated with, endorsed by, or sponsored by Recreational Equipment Inc. (REI).
Support: Submit an issue from the Issues tab for bug reports, questions, or custom requests.
Need a custom scraper? If you need authenticated member pricing, special custom columns, or massive enterprise volumes, reach out to us through our Apify profile page.

👁 PubMed Scraper for RAG: Papers as Chunked JSON avatar

PubMed Scraper for RAG: Papers as Chunked JSON

getascraper/pubmed-rag-extractor

Scrape PubMed citations by search term, MeSH, and article type. Returns RAG-ready JSON with full-text chunks from PMC Open Access (cl100k_base, 512/50) and abstract fallback. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. Skip GROBID / Pubmed Parser. $0.02 per paper.

👁 User avatar

GetAScraper

👁 bioRxiv & medRxiv Preprint Scraper avatar

bioRxiv & medRxiv Preprint Scraper

crawlerbros/biorxiv-medrxiv-scraper

Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.

👁 User avatar

Crawler Bros

👁 bioRxiv & medRxiv Preprint Scraper avatar

bioRxiv & medRxiv Preprint Scraper

crawlergang/biorxiv-medrxiv-scraper

👁 User avatar

Crawler Gang

5.0

👁 bioRxiv and medRxiv Preprints Scraper avatar

bioRxiv and medRxiv Preprints Scraper

parseforge/biorxiv-recent-scraper

Track the latest preprints from bioRxiv or medRxiv inside any date window. Returns DOI, title, authors, posting date, category, abstract, version, server, JATS XML link, and license. Useful for literature surveillance, competitive science intelligence, and rapid biomedical research review.

👁 User avatar

ParseForge

👁 arXiv Scraper for RAG: Papers as Chunked JSON avatar

arXiv Scraper for RAG: Papers as Chunked JSON

getascraper/arxiv-rag-extractor

Scrape arXiv papers by date and category. Strips LaTeX and returns RAG-ready JSON with tokenizer-aware chunks (cl100k_base, 512/50). Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, Chroma. Skip GROBID / Nougat / pandoc. $0.015 per paper.

👁 User avatar

GetAScraper

👁 bioRxiv Preprints Scraper avatar

bioRxiv Preprints Scraper

parseforge/biorxiv-preprints-scraper

Pull bioRxiv and medRxiv preprints by date range, DOI, or category. Records carry DOI, title, authors, publish date, version, type, license, category, abstract, server, and full text PDF link. Useful for literature reviews, science monitoring, and grant research.

👁 User avatar

ParseForge

Medrxiv Scraper

outstanding_vegetable/medrxiv-scraper

Scrape medRxiv medical preprints by date range. Get title, authors, abstract, DOI, category, license. Public API, free.

👁 User avatar

Peter Skotte

Biorxiv Scraper

flamboyant_liner/biorxiv-scraper

Scrape bioRxiv biological preprints by date range. Get title, authors, abstract, DOI, category, license. Public API, free.

👁 User avatar

Khrystyna Skotte

Unified Preprint Search

logical_vivacity/unified-preprint-search

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

👁 User avatar

Logical Vivacity

👁 medRxiv Scraper avatar

medRxiv Scraper

parseforge/medrxiv-scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

👁 User avatar

ParseForge

URL: https://apify.com/getascraper/biorxiv-medrxiv-rag-extractor

⇱ bioRxiv + medRxiv API: Preprint JSON for LLM RAG · Apify

bioRxiv + medRxiv Scraper for RAG

bioRxiv and medRxiv scraper for RAG: chunked JSON

What does this Actor do?

Why use bioRxiv and medRxiv scraper for RAG?

How to use bioRxiv and medRxiv scraper for RAG

Input

Category slugs

Output

Data table

Pricing

How much does it cost to scrape bioRxiv and medRxiv?

Limits you should know before you run

Tips

Publish your output as a HuggingFace dataset

Disclaimers and support

You might also like

PubMed Scraper for RAG: Papers as Chunked JSON

bioRxiv & medRxiv Preprint Scraper

bioRxiv & medRxiv Preprint Scraper

bioRxiv and medRxiv Preprints Scraper

arXiv Scraper for RAG: Papers as Chunked JSON

bioRxiv Preprints Scraper

Medrxiv Scraper

Biorxiv Scraper

Unified Preprint Search

medRxiv Scraper