Pricing
from $1.00 / 1,000 results
Find Broken Links
Crawl a website (start URL + same-host pages up to a configurable depth) and report every link that returns a 4xx / 5xx status, times out, or has a DNS error. HTTP-only - no proxy or browser needed.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share
Crawl a website and report every link that returns a 4xx / 5xx status, times out, or fails DNS. Bounded by maxCrawlDepth and maxPages so it stays predictable on large sites. HTTP-only โ no proxy, no browser.
What it does
You give it a start URL; the actor crawls the start page (and optionally same-host internal links up to a depth N), gathers every <a href>, and probes each one with HEAD (falling back to GET when servers reject HEAD). Records are emitted only for links that fail.
The dataset is never empty โ even a perfectly-clean site gets a final summary record with run statistics.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string (required) | https://apify.com | Page to start crawling from. Must be http:// or https://. |
maxCrawlDepth | integer | 1 (0โ5) | 0 = check links on start URL only; 1+ = follow internal links one level and check theirs too. |
maxPages | integer | 50 (1โ5000) | Hard cap on pages crawled. |
checkExternalLinks | boolean | true | Also probe links that leave the start URL's host. |
verifyWithProxy | boolean | true | When a link returns 401 / 403 / 405 / 429 / 451 (typical anti-bot signals), retry once via Apify residential proxy. If the proxy retry succeeds the link is treated as OK โ eliminates false positives from sites that block datacenter IPs (G2, Capterra, etc.). Turn off to skip the retry pass. |
maxConcurrency | integer | 10 (1โ50) | Concurrent HEAD/GET requests during the check phase. |
userAgent | string (optional) | (Chrome 131) | Override only if a target server filters by UA. |
Example input
{"startUrl":"https://apify.com","maxCrawlDepth":1,"maxPages":50,"checkExternalLinks":true,"maxConcurrency":10}
Output
Broken-link record (one per failure)
{"url":"https://example.com/old-blog-post","sourcePage":"https://apify.com/blog/index","anchorText":"Read more","linkType":"external","linkDomain":"example.com","isExternalLink":true,"httpStatus":404,"errorReason":"not_found","proxyRecheckStatus":404,"scrapedAt":"2024-12-16T14:23:11+00:00"}
Summary record (always emitted last)
{"_recordType":"summary","startUrl":"https://apify.com","pagesCrawled":12,"linksDiscovered":480,"linksChecked":480,"brokenCount":3,"okCount":477,"breakdown":{"not_found":2,"server_error":1},"maxCrawlDepth":1,"checkExternalLinks":true,"scrapedAt":"2024-12-16T14:23:18+00:00"}
Output fields
urlโ the broken link's absolute URL.sourcePageโ page where the link was first discovered.anchorTextโ visible text of the<a>element (when present).linkTypeโ"internal"(same host as start URL) or"external".linkDomainโ derived hostname of the brokenurl(lowercase, includes any port).isExternalLinkโ derived boolean:truewhen the broken link's host differs fromsourcePage's host.httpStatusโ HTTP status code (omitted for network errors / timeouts).errorReasonโ one of:not_found(404),gone(410),forbidden(403),unauthorized(401),server_error(5xx),client_error_<NNN>(other 4xx)timeout,dns_error,connection_refused,tls_error,redirect_loop,network_error
proxyRecheckStatusโ only present whenverifyWithProxy: truetriggered a retry. Shows the status returned via residential proxy (use this to distinguish real broken links from anti-bot blocks).scrapedAtโ ISO-8601 timestamp.
Use cases
- SEO audits โ every broken link costs link equity and damages user trust.
- Site migration validation โ after a CMS move, find the URLs that didn't get redirected.
- Editorial QA โ catch dead links in blog content, reference pages, footer navigation.
- Internal-tools health โ spot broken links to deprecated wikis, retired tools, expired SSO redirects.
FAQ
Does it need a proxy?
For the bulk crawl, no โ the actor uses curl_cffi with a Chrome User-Agent from a datacenter IP. Optionally, when verifyWithProxy: true (default), any link that returns 401 / 403 / 405 / 429 / 451 is retried once via Apify residential proxy. If that retry succeeds, the link is treated as OK โ this eliminates the false positives that used to surface from sites like G2, Capterra, or rate-limited APIs. The retried status is surfaced as proxyRecheckStatus so you can see both checks.
HEAD vs GET โ which is used? HEAD first (saves bandwidth). If a server returns 405 or 501, the actor falls back to GET and uses that status instead.
Will it follow redirects?
Yes โ allow_redirects=True for both HEAD and GET. The final status is what gets recorded.
Can I limit it to internal links only?
Set checkExternalLinks: false. The actor still walks the same-host graph for discovery but only probes internal links.
Why is the dataset never empty?
Even when no broken links are found, a _recordType: "summary" record is emitted with run stats. This keeps Apify's daily-test happy and gives you a quick health pulse for the site.
My start URL has thousands of pages โ will this finish in time?
Use maxPages and maxCrawlDepth to keep runs bounded. For large sites, consider running with maxCrawlDepth: 0 first to audit the start page's links, then expand outward.
The summary says brokenCount: 0 but I know some links are dead.
- The link may use a non-HTTP scheme (mailto, javascript:, data:) โ those aren't checkable.
- The link may be JS-rendered (this scraper sees only server-rendered HTML).
- The target may serve different content / status to its own site than to a generic crawler โ try with the site's own User-Agent via
userAgent.
