VOOZH about

URL: https://apify.com/draouadmohamed/arxiv-semantic-search

โ‡ฑ Arxiv Semantic Search ยท Apify


Pricing

from $5.00 / 1,000 relevant paper founds

Go to Apify Store

Arxiv Semantic Search

Scrape arXiv papers by category and find relevant research using AI-powered semantic search. Get papers from any field (AI, physics, biology, economics, etc.) with embeddings for RAG systems. Find your categories at: https://arxiv.org/category_taxonomy

Pricing

from $5.00 / 1,000 relevant paper founds

Rating

0.0

(0)

Developer

๐Ÿ‘ Mohamed Aouad

Mohamed Aouad

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

0

Monthly active users

6 months ago

Last modified

Share

ArXiv Scraper & Semantic Search

Scrape academic papers from arXiv.org by category and perform semantic search on abstracts using AI-powered embeddings. Perfect for researchers, literature reviews, and building AI agents that need access to scientific papers.

Features

  • ๐Ÿ“š Scrape arXiv papers by category (quantum physics, AI, condensed matter, etc.)
  • ๐Ÿ” Semantic search using Sentence-BERT embeddings (384-dimensional vectors)
  • ๐ŸŽฏ Ranked results by cosine similarity to your query
  • ๐Ÿ“… Date filtering to get papers from specific time periods
  • โšก Fast & efficient - processes 50 papers in ~30 seconds
  • ๐Ÿ”„ Automatic retries with exponential backoff for API failures

Use Cases

  • Literature reviews - Find papers similar to your research topic
  • Research discovery - Explore related work in your field
  • AI agents - Build RAG systems with scientific knowledge
  • Citation analysis - Track papers in specific domains
  • Trend monitoring - Monitor new papers in your categories

How It Works

  1. Fetch papers from arXiv API using category filters
  2. Generate embeddings for paper abstracts using Sentence-BERT (all-MiniLM-L6-v2)
  3. Search semantically by comparing query embedding to paper embeddings
  4. Return ranked results sorted by similarity score

๐Ÿš€ Quick Start for Your Research Domain

New to arXiv categories? We've got you covered:

  • ๐Ÿ“– CATEGORY_GUIDE.md - Find categories for your field
  • ๐Ÿ“‹ QUICK_START_EXAMPLES.md - Copy-paste ready configurations
  • ๐Ÿ“š USER_GUIDE.md - Complete usage guide with workflows

Popular Research Domains

Your FieldCategories to UseExample
AI/Machine Learningcs.AI, cs.LG, cs.CLLLMs, transformers, neural networks
Physicsquant-ph, cond-matQuantum computing, semiconductors
Biologyq-bio.NC, q-bio.BMNeuroscience, protein folding
Economicsecon.EM, stat.MEEconometrics, causal inference
Math/Statisticsstat.ML, math.STStatistical methods, probability

Input Parameters

ParameterTypeDefaultDescription
categoriesarray["cs.AI"]arXiv categories - CATEGORY_GUIDE.md
maxPapersinteger100Papers per category (1-1000)
startDatestringnullStart date (YYYY-MM-DD), e.g., "2024-01-01"
endDatestringnullEnd date (YYYY-MM-DD), e.g., "2024-12-31"
enableSemanticSearchbooleanfalseRank papers by relevance to your query
searchQuerystringnullPlain English query, e.g., "quantum computing"
topKinteger10Number of top results (1-100)

Full category list: arXiv category taxonomy or CATEGORY_GUIDE.md

Output Format

Each paper in the dataset contains:

{
"id":"2512.05101v1",
"title":"Decoy-state quantum key distribution over 227 km...",
"abstract":"We demonstrate quantum key distribution using...",
"authors":["John Doe","Jane Smith"],
"published":"2024-12-05T10:30:00Z",
"updated":"2024-12-05T10:30:00Z",
"categories":["quant-ph","physics.optics"],
"pdf_url":"https://arxiv.org/pdf/2512.05101v1",
"arxiv_url":"http://arxiv.org/abs/2512.05101v1",
"embedding":[0.123,-0.456, ...],// 384-dim vector (if enableSemanticSearch=true)
"similarity_score":0.87// Only present if searchQuery was provided
}

Usage Examples

Example 1: Scrape Recent Quantum Physics Papers

{
"categories":["quant-ph"],
"maxPapers":50,
"startDate":"2024-12-01",
"endDate":"2024-12-05"
}

Example 2: Semantic Search for Topological Quantum Computing

{
"categories":["quant-ph","cond-mat.mes-hall"],
"maxPapers":100,
"enableSemanticSearch":true,
"searchQuery":"topological quantum computing and anyons",
"topK":10
}

Example 3: Find AI Papers on Transformers

{
"categories":["cs.AI","cs.LG","cs.CL"],
"maxPapers":200,
"enableSemanticSearch":true,
"searchQuery":"transformer architecture attention mechanisms",
"topK":20
}

Getting Started

Run Locally

# Install Apify CLI
npminstall-g apify-cli
# Clone or create the Actor
apify run

Deploy to Apify

# Login to Apify
apify login
# Deploy the Actor
apify push

Performance

  • Scraping: ~100 papers/minute from arXiv API
  • Embeddings: ~30 seconds for 50 papers (first run downloads model)
  • Search: <1 second for 100 papers (in-memory cosine similarity)
  • Memory: ~500MB (includes PyTorch + Sentence-BERT model)

Technical Details

  • Embedding Model: sentence-transformers/all-MiniLM-L6-v2
    • Dimensions: 384
    • Speed: ~1000 sentences/second
    • Quality: High semantic similarity accuracy
  • Search Algorithm: Cosine similarity (scikit-learn)
  • API: arXiv Atom API (no authentication required)
  • Rate Limiting: Automatic retry with exponential backoff

Error Handling

The Actor gracefully handles:

  • โœ… Network failures (3 retries with exponential backoff)
  • โœ… Invalid categories (warns but continues)
  • โœ… Empty results (returns empty dataset)
  • โœ… Missing embeddings (falls back to scraping only)
  • โœ… Invalid input parameters (clear error messages)

Limitations

  • arXiv API rate limit: ~1 request/3 seconds (handled automatically)
  • Maximum papers per request: 1000
  • Embedding generation requires ~500MB RAM
  • Search is in-memory (not suitable for >10,000 papers)

Resources

Support

License

Apache 2.0


Built with โค๏ธ for researchers and AI developers

You might also like

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

10

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.

ArXiv Paper Scraper

sheshinmcfly/arxiv-paper-scraper

Search and extract scientific papers from ArXiv.org across any field. Returns title, authors, full abstract, PDF link, arXiv ID, categories, and submission date. Ideal for AI research monitoring, RAG pipelines, literature reviews, and academic trend analysis. No API key needed.

Arxiv Paper Intelligence

viralanalyzer/arxiv-paper-intelligence

Search and extract ArXiv papers, abstracts, authors, and citations. Track research trends across any scientific field. AI-powered analysis.

8

5.0

arXiv Scraper - Scientific Papers, Abstracts & PDFs

benthepythondev/arxiv-scraper

arXiv Scraper for the official arXiv API. Search 2M+ scientific papers in CS, physics, math and biology by keyword, title, author, abstract or category. Extract title, authors, abstract, categories, DOI, dates and PDF links. For AI/ML research, literature reviews and RAG datasets.

arXiv Scraper

jungle_synthesizer/arxiv-scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Returns titles, authors, abstracts, categories, and PDF links.

๐Ÿ‘ User avatar

BowTiedRaccoon

2