Scrapy Cloud Runner

Pricing

from $2.00 / 1,000 results

Try for free

Go to Apify Store

👁 Scrapy Cloud Runner

Scrapy Cloud Runner

Try for free

Run Scrapy spiders on Apify with request queue, dataset export, proxy rotation, scheduling, and cloud-ready deployment.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

👁 Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

23 days ago

Last modified

What this actor does

Scrapy Cloud Runner is a Python Apify Actor that executes Scrapy spiders bundled with the actor codebase. It uses the official Apify Python SDK and Scrapy integration so you can run, schedule, and monitor Scrapy crawls in the Apify Console or through the API.

The actor:

runs a selected Scrapy spider by spiderName
reads input from the Apify input form or API
pushes scraped items to the default dataset
stores a crawl summary in the OUTPUT key-value store record
supports Apify proxy configuration
exposes crawl controls for limits, retries, delays, cache, and robots.txt

Included spider

The actor includes one bundled example spider:

page_meta: crawls pages, extracts basic page metadata, and optionally follows links

The example spider is designed to be a solid starting point, not a universal website crawler. By default it stays on the same hostname as the start URLs to avoid drifting into subdomains with different blocking or rate-limit behavior.

Why use it on Apify

Running Scrapy on Apify gives you:

scheduled runs
API-triggered runs
centralized logs
dataset export
proxy integration
managed cloud execution

You keep the Scrapy spider model, but you do not need to manage servers, deployment plumbing, or result storage yourself.

Quick start

Open the actor in Apify Console.
Set spiderName to page_meta or to your own bundled spider.
Add one or more startUrls.
Keep the default limits for the first run.
Run the actor.
Review results in the Dataset tab and the summary in the OUTPUT record.

Example input

{
"spiderName":"page_meta",
"startUrls":[
{"url":"https://apify.com"}
],
"followLinks":true,
"sameHostnameOnly":true,
"includeHtml":false,
"maxRequestsPerCrawl":20,
"maxDepth":1,
"maxConcurrency":16,
"requestTimeoutSecs":30,
"downloadDelaySecs":1,
"retryTimes":2,
"useAutoThrottle":true,
"autoThrottleTargetConcurrency":1,
"autoThrottleStartDelaySecs":1,
"autoThrottleMaxDelaySecs":15,
"respectRobotsTxt":true,
"useHttpCache":true,
"httpCacheExpirationSecs":7200,
"spiderArgs":[
{"key":"category","value":"books"}
]
}

Input settings

Input	Type	Description
`spiderName`	string	Name of the bundled Scrapy spider to run.
`startUrls`	array	Starting URLs for the crawl.
`allowedDomains`	array	Optional domain allowlist for Scrapy offsite filtering.
`followLinks`	boolean	Follow links discovered on crawled pages.
`sameHostnameOnly`	boolean	Restrict followed links to the exact hostnames from `startUrls`. Recommended for focused crawls.
`includeHtml`	boolean	Include raw HTML in dataset items.
`maxRequestsPerCrawl`	integer	Maximum number of scraped pages/items emitted by the bundled spider.
`maxDepth`	integer	Maximum follow depth from the initial pages.
`maxConcurrency`	integer	Maximum concurrent Scrapy requests.
`requestTimeoutSecs`	integer	Download timeout per request.
`downloadDelaySecs`	number	Base delay between requests to the same site.
`retryTimes`	integer	Retry count for retryable failures.
`useAutoThrottle`	boolean	Enable Scrapy AutoThrottle.
`autoThrottleTargetConcurrency`	number	Target average concurrency per remote site.
`autoThrottleStartDelaySecs`	number	Initial AutoThrottle delay.
`autoThrottleMaxDelaySecs`	number	Maximum AutoThrottle delay.
`respectRobotsTxt`	boolean	Respect robots.txt.
`useHttpCache`	boolean	Enable HTTP cache.
`httpCacheExpirationSecs`	integer	Cache expiration time in seconds.
`proxyConfiguration`	object	Apify proxy configuration.
`spiderArgs`	array	Spider arguments entered as schema-based key/value rows in Apify Console.
`spiderArgsJson`	object	Structured spider arguments for API callers. Merged over `spiderArgs` on duplicate keys.

Output

The default dataset contains one item per scraped page. For the bundled page_meta spider, each item includes fields such as:

{
"url":"https://apify.com",
"status":200,
"title":"Apify: Full-stack web scraping and data extraction platform",
"metaDescription":"Cloud platform for web scraping, browser automation, AI agents, and data for AI.",
"canonicalUrl":"https://apify.com",
"h1":"Get real-time web data for your AI",
"contentType":"text/html; charset=utf-8",
"depth":0,
"referrer":null,
"textLength":125419,
"crawledAt":"2026-05-16T11:21:11.435924+00:00",
"html":null
}

The actor also stores a summary record in OUTPUT:

{
"availableSpiders":["page_meta"],
"finishedAt":"2026-05-16T11:21:15.000000+00:00",
"itemCount":5,
"requestCount":12,
"spiderName":"page_meta",
"startedAt":"2026-05-16T11:21:10.000000+00:00",
"stats":{}
}

Default crawl behavior

The bundled actor defaults are tuned for focused website crawls:

same-host following is enabled by default
AutoThrottle is enabled by default
HTTP cache uses RFC2616 policy
common blocked/error responses such as 403 and 429 are not cached
cookies are disabled
robots.txt is respected by default

These defaults are more conservative and more production-friendly than simply running Scrapy at high parallelism.

Add your own spiders

Add a spider module under src/spiders/.
Give the spider a unique Scrapy name.
Read any custom runtime options from spider kwargs or spiderArgsJson.
Deploy the updated actor.
Run the actor with spiderName set to your spider's name.

The actor uses Scrapy's spider loader, so bundled spiders are discovered automatically from src.spiders.

Practical guidance

Start with one or two startUrls.
Keep sameHostnameOnly enabled unless you intentionally want cross-subdomain crawling.
Use proxy configuration for websites with blocking or rate limiting.
Keep includeHtml off unless you need full source in the dataset.
For broad or multi-domain crawling, create a dedicated spider with different settings instead of using the bundled example as-is.

Legal and operational note

You are responsible for using this actor in compliance with the target site's terms, applicable law, and reasonable load limits. Keep respectRobotsTxt enabled unless you have a clear reason not to.

Scrapy Cloud Runner

sovanza.inc/scrapy-cloud-runner

Scrapy Cloud Runner runs Scrapy spiders on Apify with runtime arguments, custom settings, schedules, webhooks, and automatic dataset export. It supports custom spiders, compact JSON output, and JSON/CSV/Excel dataset downloads.

👁 User avatar

Sovanza

5.0

👁 Python Scrapy template avatar

Python Scrapy template

ellustar/python-scrapy-template

“A ready-to-use Python Scrapy template designed for building fast and scalable data extraction actors. Includes a clean project structure, example spiders, settings configuration, and best practices to help developers quickly create, customize, and deploy Scrapy-based workflows.”

👁 User avatar

Ellustar

Python Scrapy template

ellustar/my-actor-41

A Python Scrapy template designed to extract actor names efficiently from websites. It includes structured spiders, customizable selectors, and clean output handling, making it easy to scrape, store, and scale actor data for research, media, or automation projects.

👁 User avatar

Ellustar

👁 Cloud Details Spider avatar

Cloud Details Spider

getdataforme/cloud-details-spider

Cloud Details Spider extracts comprehensive details from cloud service pages, including titles, features, pricing, and links, delivering structured JSON for easy analysis. Supports batch URL processing, reliable parsing, and seamless Apify integration, ideal for research, monitoring, and automation.

👁 User avatar

GetDataForMe

👁 Scrapy Books Example avatar

Scrapy Books Example

vdusek/scrapy-books-example

Example of Python Scrapy project. It scrapes book data from https://books.toscrape.com/.

👁 User avatar

Vlada Dusek

👁 Best Linkedin Jobs Scrapy avatar

Best Linkedin Jobs Scrapy

lads.yc/easy-linkedin-jobs-scrapy

Easy way to get jobs and details

👁 User avatar

YC W

Cloud Gpu Pricing

david_flagg/cloud-gpu-pricing

👁 User avatar

David Flagg

👁 Selenium Cloud Runner avatar

Selenium Cloud Runner

sovanza.inc/selenium-cloud-runner

Selenium Cloud Runner scrapes JavaScript-heavy websites using Selenium and headless Chrome. It extracts data with CSS or XPath rules, supports scrolling, popup handling, screenshots, proxies, retries, and structured dataset exports.

👁 User avatar

Sovanza

5.0

👁 Youtube Video Downloader ✅ | No proxy needed avatar