VOOZH about

URL: https://apify.com/solutionssmart/scrapy-cloud-runner

โ‡ฑ Scrapy Cloud Runner โ€“ Run Python Scrapy Spiders on Apify ยท Apify


Pricing

from $2.00 / 1,000 results

Go to Apify Store

Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

23 days ago

Last modified

Share

Run Scrapy spiders on Apify with cloud scheduling, API access, dataset output, proxy support, and production-friendly crawl defaults.

What this actor does

Scrapy Cloud Runner is a Python Apify Actor that executes Scrapy spiders bundled with the actor codebase. It uses the official Apify Python SDK and Scrapy integration so you can run, schedule, and monitor Scrapy crawls in the Apify Console or through the API.

The actor:

  • runs a selected Scrapy spider by spiderName
  • reads input from the Apify input form or API
  • pushes scraped items to the default dataset
  • stores a crawl summary in the OUTPUT key-value store record
  • supports Apify proxy configuration
  • exposes crawl controls for limits, retries, delays, cache, and robots.txt

Included spider

The actor includes one bundled example spider:

  • page_meta: crawls pages, extracts basic page metadata, and optionally follows links

The example spider is designed to be a solid starting point, not a universal website crawler. By default it stays on the same hostname as the start URLs to avoid drifting into subdomains with different blocking or rate-limit behavior.

Why use it on Apify

Running Scrapy on Apify gives you:

  • scheduled runs
  • API-triggered runs
  • centralized logs
  • dataset export
  • proxy integration
  • managed cloud execution

You keep the Scrapy spider model, but you do not need to manage servers, deployment plumbing, or result storage yourself.

Quick start

  1. Open the actor in Apify Console.
  2. Set spiderName to page_meta or to your own bundled spider.
  3. Add one or more startUrls.
  4. Keep the default limits for the first run.
  5. Run the actor.
  6. Review results in the Dataset tab and the summary in the OUTPUT record.

Example input

{
"spiderName":"page_meta",
"startUrls":[
{"url":"https://apify.com"}
],
"followLinks":true,
"sameHostnameOnly":true,
"includeHtml":false,
"maxRequestsPerCrawl":20,
"maxDepth":1,
"maxConcurrency":16,
"requestTimeoutSecs":30,
"downloadDelaySecs":1,
"retryTimes":2,
"useAutoThrottle":true,
"autoThrottleTargetConcurrency":1,
"autoThrottleStartDelaySecs":1,
"autoThrottleMaxDelaySecs":15,
"respectRobotsTxt":true,
"useHttpCache":true,
"httpCacheExpirationSecs":7200,
"spiderArgs":[
{"key":"category","value":"books"}
]
}

Input settings

InputTypeDescription
spiderNamestringName of the bundled Scrapy spider to run.
startUrlsarrayStarting URLs for the crawl.
allowedDomainsarrayOptional domain allowlist for Scrapy offsite filtering.
followLinksbooleanFollow links discovered on crawled pages.
sameHostnameOnlybooleanRestrict followed links to the exact hostnames from startUrls. Recommended for focused crawls.
includeHtmlbooleanInclude raw HTML in dataset items.
maxRequestsPerCrawlintegerMaximum number of scraped pages/items emitted by the bundled spider.
maxDepthintegerMaximum follow depth from the initial pages.
maxConcurrencyintegerMaximum concurrent Scrapy requests.
requestTimeoutSecsintegerDownload timeout per request.
downloadDelaySecsnumberBase delay between requests to the same site.
retryTimesintegerRetry count for retryable failures.
useAutoThrottlebooleanEnable Scrapy AutoThrottle.
autoThrottleTargetConcurrencynumberTarget average concurrency per remote site.
autoThrottleStartDelaySecsnumberInitial AutoThrottle delay.
autoThrottleMaxDelaySecsnumberMaximum AutoThrottle delay.
respectRobotsTxtbooleanRespect robots.txt.
useHttpCachebooleanEnable HTTP cache.
httpCacheExpirationSecsintegerCache expiration time in seconds.
proxyConfigurationobjectApify proxy configuration.
spiderArgsarraySpider arguments entered as schema-based key/value rows in Apify Console.
spiderArgsJsonobjectStructured spider arguments for API callers. Merged over spiderArgs on duplicate keys.

Output

The default dataset contains one item per scraped page. For the bundled page_meta spider, each item includes fields such as:

{
"url":"https://apify.com",
"status":200,
"title":"Apify: Full-stack web scraping and data extraction platform",
"metaDescription":"Cloud platform for web scraping, browser automation, AI agents, and data for AI.",
"canonicalUrl":"https://apify.com",
"h1":"Get real-time web data for your AI",
"contentType":"text/html; charset=utf-8",
"depth":0,
"referrer":null,
"textLength":125419,
"crawledAt":"2026-05-16T11:21:11.435924+00:00",
"html":null
}

The actor also stores a summary record in OUTPUT:

{
"availableSpiders":["page_meta"],
"finishedAt":"2026-05-16T11:21:15.000000+00:00",
"itemCount":5,
"requestCount":12,
"spiderName":"page_meta",
"startedAt":"2026-05-16T11:21:10.000000+00:00",
"stats":{}
}

Default crawl behavior

The bundled actor defaults are tuned for focused website crawls:

  • same-host following is enabled by default
  • AutoThrottle is enabled by default
  • HTTP cache uses RFC2616 policy
  • common blocked/error responses such as 403 and 429 are not cached
  • cookies are disabled
  • robots.txt is respected by default

These defaults are more conservative and more production-friendly than simply running Scrapy at high parallelism.

Add your own spiders

  1. Add a spider module under src/spiders/.
  2. Give the spider a unique Scrapy name.
  3. Read any custom runtime options from spider kwargs or spiderArgsJson.
  4. Deploy the updated actor.
  5. Run the actor with spiderName set to your spider's name.

The actor uses Scrapy's spider loader, so bundled spiders are discovered automatically from src.spiders.

Practical guidance

  • Start with one or two startUrls.
  • Keep sameHostnameOnly enabled unless you intentionally want cross-subdomain crawling.
  • Use proxy configuration for websites with blocking or rate limiting.
  • Keep includeHtml off unless you need full source in the dataset.
  • For broad or multi-domain crawling, create a dedicated spider with different settings instead of using the bundled example as-is.

Legal and operational note

You are responsible for using this actor in compliance with the target site's terms, applicable law, and reasonable load limits. Keep respectRobotsTxt enabled unless you have a clear reason not to.

You might also like

Python Scrapy template

ellustar/python-scrapy-template

โ€œA ready-to-use Python Scrapy template designed for building fast and scalable data extraction actors. Includes a clean project structure, example spiders, settings configuration, and best practices to help developers quickly create, customize, and deploy Scrapy-based workflows.โ€

Cloud Details Spider

getdataforme/cloud-details-spider

Cloud Details Spider extracts comprehensive details from cloud service pages, including titles, features, pricing, and links, delivering structured JSON for easy analysis. Supports batch URL processing, reliable parsing, and seamless Apify integration, ideal for research, monitoring, and automation.

Scrapy Books Example

vdusek/scrapy-books-example

Example of Python Scrapy project. It scrapes book data from https://books.toscrape.com/.

Best Linkedin Jobs Scrapy

lads.yc/easy-linkedin-jobs-scrapy

Easy way to get jobs and details

Selenium Cloud Runner

sovanza.inc/selenium-cloud-runner

Selenium Cloud Runner scrapes JavaScript-heavy websites using Selenium and headless Chrome. It extracts data with CSS or XPath rules, supports scrolling, popup handling, screenshots, proxies, retries, and structured dataset exports.

Youtube Video Downloader โœ… | No proxy needed

x_guru/youtube-video-downloader

Download YouTube videos and Shorts with original audio. No proxy needed. Save to Apify storage or your own cloud (AWS S3, Azure, Google Cloud).

๐Ÿ‘ User avatar

Hundevmode Labs

18

Fhweek Details Spider

getdataforme/fhweek-details-spider

Scrapes titles of websites using Scrapy.

Related articles

Web scraping with Scrapy 101
Read more
Scrapy vs. Beautiful Soup: which one to choose for web scraping
Read more
5 Scrapy alternatives for web scraping
Read more