VOOZH about

URL: https://apify.com/crawlerbros/find-broken-links

โ‡ฑ Find Broken Links ยท Apify


Pricing

from $1.00 / 1,000 results

Go to Apify Store

Crawl a website (start URL + same-host pages up to a configurable depth) and report every link that returns a 4xx / 5xx status, times out, or has a DNS error. HTTP-only - no proxy or browser needed.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

๐Ÿ‘ Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Crawl a website and report every link that returns a 4xx / 5xx status, times out, or fails DNS. Bounded by maxCrawlDepth and maxPages so it stays predictable on large sites. HTTP-only โ€” no proxy, no browser.

What it does

You give it a start URL; the actor crawls the start page (and optionally same-host internal links up to a depth N), gathers every <a href>, and probes each one with HEAD (falling back to GET when servers reject HEAD). Records are emitted only for links that fail.

The dataset is never empty โ€” even a perfectly-clean site gets a final summary record with run statistics.

Input

FieldTypeDefaultDescription
startUrlstring (required)https://apify.comPage to start crawling from. Must be http:// or https://.
maxCrawlDepthinteger1 (0โ€“5)0 = check links on start URL only; 1+ = follow internal links one level and check theirs too.
maxPagesinteger50 (1โ€“5000)Hard cap on pages crawled.
checkExternalLinksbooleantrueAlso probe links that leave the start URL's host.
verifyWithProxybooleantrueWhen a link returns 401 / 403 / 405 / 429 / 451 (typical anti-bot signals), retry once via Apify residential proxy. If the proxy retry succeeds the link is treated as OK โ€” eliminates false positives from sites that block datacenter IPs (G2, Capterra, etc.). Turn off to skip the retry pass.
maxConcurrencyinteger10 (1โ€“50)Concurrent HEAD/GET requests during the check phase.
userAgentstring (optional)(Chrome 131)Override only if a target server filters by UA.

Example input

{
"startUrl":"https://apify.com",
"maxCrawlDepth":1,
"maxPages":50,
"checkExternalLinks":true,
"maxConcurrency":10
}

Output

Broken-link record (one per failure)

{
"url":"https://example.com/old-blog-post",
"sourcePage":"https://apify.com/blog/index",
"anchorText":"Read more",
"linkType":"external",
"linkDomain":"example.com",
"isExternalLink":true,
"httpStatus":404,
"errorReason":"not_found",
"proxyRecheckStatus":404,
"scrapedAt":"2024-12-16T14:23:11+00:00"
}

Summary record (always emitted last)

{
"_recordType":"summary",
"startUrl":"https://apify.com",
"pagesCrawled":12,
"linksDiscovered":480,
"linksChecked":480,
"brokenCount":3,
"okCount":477,
"breakdown":{"not_found":2,"server_error":1},
"maxCrawlDepth":1,
"checkExternalLinks":true,
"scrapedAt":"2024-12-16T14:23:18+00:00"
}

Output fields

  • url โ€” the broken link's absolute URL.
  • sourcePage โ€” page where the link was first discovered.
  • anchorText โ€” visible text of the <a> element (when present).
  • linkType โ€” "internal" (same host as start URL) or "external".
  • linkDomain โ€” derived hostname of the broken url (lowercase, includes any port).
  • isExternalLink โ€” derived boolean: true when the broken link's host differs from sourcePage's host.
  • httpStatus โ€” HTTP status code (omitted for network errors / timeouts).
  • errorReason โ€” one of:
    • not_found (404), gone (410), forbidden (403), unauthorized (401), server_error (5xx), client_error_<NNN> (other 4xx)
    • timeout, dns_error, connection_refused, tls_error, redirect_loop, network_error
  • proxyRecheckStatus โ€” only present when verifyWithProxy: true triggered a retry. Shows the status returned via residential proxy (use this to distinguish real broken links from anti-bot blocks).
  • scrapedAt โ€” ISO-8601 timestamp.

Use cases

  • SEO audits โ€” every broken link costs link equity and damages user trust.
  • Site migration validation โ€” after a CMS move, find the URLs that didn't get redirected.
  • Editorial QA โ€” catch dead links in blog content, reference pages, footer navigation.
  • Internal-tools health โ€” spot broken links to deprecated wikis, retired tools, expired SSO redirects.

FAQ

Does it need a proxy? For the bulk crawl, no โ€” the actor uses curl_cffi with a Chrome User-Agent from a datacenter IP. Optionally, when verifyWithProxy: true (default), any link that returns 401 / 403 / 405 / 429 / 451 is retried once via Apify residential proxy. If that retry succeeds, the link is treated as OK โ€” this eliminates the false positives that used to surface from sites like G2, Capterra, or rate-limited APIs. The retried status is surfaced as proxyRecheckStatus so you can see both checks.

HEAD vs GET โ€” which is used? HEAD first (saves bandwidth). If a server returns 405 or 501, the actor falls back to GET and uses that status instead.

Will it follow redirects? Yes โ€” allow_redirects=True for both HEAD and GET. The final status is what gets recorded.

Can I limit it to internal links only? Set checkExternalLinks: false. The actor still walks the same-host graph for discovery but only probes internal links.

Why is the dataset never empty? Even when no broken links are found, a _recordType: "summary" record is emitted with run stats. This keeps Apify's daily-test happy and gives you a quick health pulse for the site.

My start URL has thousands of pages โ€” will this finish in time? Use maxPages and maxCrawlDepth to keep runs bounded. For large sites, consider running with maxCrawlDepth: 0 first to audit the start page's links, then expand outward.

The summary says brokenCount: 0 but I know some links are dead.

  • The link may use a non-HTTP scheme (mailto, javascript:, data:) โ€” those aren't checkable.
  • The link may be JS-rendered (this scraper sees only server-rendered HTML).
  • The target may serve different content / status to its own site than to a generic crawler โ€” try with the site's own User-Agent via userAgent.

You might also like

Website Broken Links & Redirects Checker

smart-digital/website-broken-links-redirects-checker

Analyzes websites to detect broken links (4xx/5xx) and redirects (3xx). Checks internal/external links on single pages or crawls entire sites. Provides detailed reports per page and site summary.

My Smart Digital

29

5.0

Website Image Scraper

crawlerbros/website-image-scraper

Extract every image URL from a website. Crawls the start page (and optionally internal links up to a configurable depth), parses `<img>` tags, `<picture>`/`<source>`, `srcset` candidates, and CSS `background-image` declarations. HTTP-only, no proxy or browser needed.

35

Broken Link Checker โ€” Recursive Site Crawler

accurate_pouch/broken-link-checker

Recursively crawl your website and find every broken link, 404, redirect, and timeout. Checks internal and external links with configurable depth. 100 links free per run.

๐Ÿ‘ User avatar

Manchitt Sanan

3

Website URL Crawler & Link Extractor

maximedupre/website-url-crawler

Crawl JavaScript-rendered websites and export a URL link map. Get source pages, depth, anchor text, link type, HTTP metadata, and crawl status.

๐Ÿ‘ User avatar

Maxime Duprรฉ

4

Broken Link Checker โ€” Find 404s, Dead Links & Redirect Issues

khadinakbar/broken-link-checker

Crawl a website, scan a URL list, or verify all URLs from a sitemap. Returns broken links with source page, anchor text, status, redirect chain, and failure class โ€” for SEO audits, content QA, and migration validation.

Related articles

Error code 1020: Why Cloudflare blocks you and how to fix it
Read more
Web scraping: how to solve 403 errors
Read more
What is a proxy server?
Read more