VOOZH about

URL: https://www.nuget.org/packages/Soenneker.Playwrights.Crawler/

⇱ NuGet Gallery | Soenneker.Playwrights.Crawler 4.0.101




👁 Image
Soenneker.Playwrights.Crawler 4.0.101

Prefix Reserved
dotnet add package Soenneker.Playwrights.Crawler --version 4.0.101
 
 
NuGet\Install-Package Soenneker.Playwrights.Crawler -Version 4.0.101
 
 
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Soenneker.Playwrights.Crawler" Version="4.0.101" />
 
 
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Soenneker.Playwrights.Crawler" Version="4.0.101" />
 
Directory.Packages.props
<PackageReference Include="Soenneker.Playwrights.Crawler" />
 
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Soenneker.Playwrights.Crawler --version 4.0.101
 
 
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: Soenneker.Playwrights.Crawler, 4.0.101"
 
 
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Soenneker.Playwrights.Crawler@4.0.101
 
 
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Soenneker.Playwrights.Crawler&version=4.0.101
 
Install as a Cake Addin
#tool nuget:?package=Soenneker.Playwrights.Crawler&version=4.0.101
 
Install as a Cake Tool
The NuGet Team does not provide support for this client. Please contact its maintainers for support.

👁 alternate text is missing from this package README image
👁 alternate text is missing from this package README image
👁 alternate text is missing from this package README image

👁 alternate text is missing from this package README image
Soenneker.Playwrights.Crawler

A configurable Playwright crawler for mirroring sites to disk with support for:

  • HTML-only or full resource capture
  • crawl limits by depth, page count, duration, and storage
  • same-host restrictions with optional cross-origin asset capture
  • DOM attribute resource discovery for lazy widgets and deferred assets
  • throttling, retries, slow mode, and cooldown behavior
  • optional stealth launch/context settings

Related Repos

You might also be interested in:

Installation

dotnet add package Soenneker.Playwrights.Crawler

Register With DI

using Microsoft.Extensions.DependencyInjection;
using Soenneker.Playwrights.Crawler.Registrars;

var services = new ServiceCollection();

services.AddLogging();
services.AddPlaywrightCrawlerAsSingleton();

Use AddPlaywrightCrawlerAsScoped() if you prefer a scoped lifetime.

Basic Usage

using Soenneker.Playwrights.Crawler.Abstract;
using Soenneker.Playwrights.Crawler.Dtos;
using Soenneker.Playwrights.Crawler.Enums;

IPlaywrightCrawler crawler = serviceProvider.GetRequiredService<IPlaywrightCrawler>();

PlaywrightCrawlResult result = await crawler.Crawl(new PlaywrightCrawlOptions
{
 Url = "https://example.com",
 SaveDirectory = @"C:\temp\example",
 Mode = PlaywrightCrawlMode.Full,
 MaxDepth = 2,
 ClearSaveDirectory = true,
 SameHostOnly = true
});

Advanced Example

using Soenneker.Playwrights.Crawler.Abstract;
using Soenneker.Playwrights.Crawler.Dtos;
using Soenneker.Playwrights.Crawler.Enums;
using Soenneker.Playwrights.Extensions.Stealth.Options;

PlaywrightCrawlResult result = await crawler.Crawl(new PlaywrightCrawlOptions
{
 Url = "https://example.com",
 SaveDirectory = @"C:\temp\example",
 Mode = PlaywrightCrawlMode.Full,
 MaxDepth = 2,
 MaxPages = 50,
 MaxStorageBytes = 250_000_000,
 MaxDuration = TimeSpan.FromMinutes(10),
 SameHostOnly = true,
 IgnoreQueryStringsInDuplicateDetection = true,
 FormatHtml = true,
 IncludeCrossOriginAssets = true,
 RewriteCrossOriginAssetUrls = true,
 ClearSaveDirectory = true,
 OverwriteExistingFiles = true,
 Headless = true,
 UseStealth = true,
 ThrottleMode = PlaywrightCrawlThrottleMode.Automatic,
 NavigationTimeoutMs = 45_000,
 WaitUntil = WaitUntilState.NetworkIdle,
 PostNavigationDelayMs = 0,
 ContinueOnPageError = true,
 StealthLaunchOptions = new StealthLaunchOptions
 {
 IgnoreDetectableDefaultArguments = true
 },
 StealthContextOptions = new StealthContextOptions
 {
 NormalizeDocumentHeaders = true,
 EnableCdpDomainHardening = false
 },
 Policy = new PlaywrightCrawlPolicy
 {
 GlobalMaxConcurrency = 20,
 PerDomainMaxConcurrency = 2,
 PerIpMaxConcurrency = 2,
 MinimumDelayBetweenRequestsMs = 750,
 DelayJitterMaxMs = 500,
 RequestTimeoutMs = 30_000,
 MaxRetries = 4
 }
});

Modes

HtmlOnly

Saves only rendered HTML documents discovered during the crawl.

Full

Saves:

  • rendered HTML documents
  • same-origin network resources observed while pages load
  • resource URLs discovered in DOM attributes such as data-src and data-css-url
  • optional cross-origin assets under _external when IncludeCrossOriginAssets = true
  • optional rewriting of cross-origin asset URLs in saved HTML when RewriteCrossOriginAssetUrls = true
  • optional lazy-load scrolling to capture below-the-fold media
  • optional rewriting of same-origin absolute URLs in saved HTML and CSS to root-relative paths when RewriteSameOriginAbsoluteUrls = true

Key Options

Option Description
Url Required absolute http or https root URL.
SaveDirectory Required output directory for mirrored content.
MaxDepth Link depth to follow from the root page. 0 crawls only the starting page.
MaxPages Optional hard cap on visited pages.
MaxStorageBytes Optional hard cap on bytes written to disk.
MaxDuration Optional maximum crawl duration.
SameHostOnly Restricts queued pages to the same host as the root URL.
IgnoreQueryStringsInDuplicateDetection Treats query-string variants as the same page when detecting duplicates.
FormatHtml Formats saved HTML documents with Soenneker.Html.Formatter when true. Defaults to false.
IncludeCrossOriginAssets In Full mode, saves cross-origin resources under _external.
RewriteCrossOriginAssetUrls Rewrites saved HTML so captured cross-origin asset URLs point at the local _external copy. Requires IncludeCrossOriginAssets.
RewriteSameOriginAbsoluteUrls Rewrites same-origin absolute URLs in saved HTML and CSS to root-relative paths, such as https://example.com/script.js to /script.js.
TriggerLazyLoading In Full mode, scrolls pages after navigation to trigger lazy-loaded media before resources are saved. Defaults to true.
LazyLoadScrollStepPx Pixel distance for each lazy-load scroll step.
LazyLoadScrollDelayMs Delay after each lazy-load scroll step.
LazyLoadMaxScrolls Maximum number of lazy-load scroll steps per page.
ClearSaveDirectory Deletes the output directory before crawling.
OverwriteExistingFiles Controls whether existing files can be replaced.
Headless Runs Chromium headlessly when true.
UseStealth Enables the Soenneker stealth Playwright extensions.
ThrottleMode Controls automatic pacing and adaptive throttling. Defaults to Automatic; use Disabled to bypass automatic pacing, slow mode, cooldown waiting, and implicit post-navigation jitter.
NavigationTimeoutMs Navigation timeout per page.
WaitUntil Playwright load state awaited during navigation. Defaults to NetworkIdle.
PostNavigationDelayMs Extra delay after navigation to allow late assets to settle.
ContinueOnPageError Continues crawling after an individual page fails.
Policy Crawl throttling, retries, concurrency, slow mode, and cooldown configuration.

Result

Crawl() returns PlaywrightCrawlResult, which includes:

  • crawl timing (StartedAtUtc, CompletedAtUtc, Duration)
  • page counts (PagesDiscovered, PagesVisited)
  • file counts (HtmlFilesSaved, AssetFilesSaved)
  • total bytes written (BytesWritten)
  • stop reasons (StorageLimitReached, DurationLimitReached, PageLimitReached)
  • per-file details in Files
  • page-level failures in Errors

Output Layout

Saved files preserve URL structure so the output can be served by a simple static web server.

Examples:

  • https://example.com/index.html
  • https://example.com/docs/getting-starteddocs/getting-started/index.html
  • https://example.com/script.js/script.js inside saved HTML when same-origin URL rewriting is enabled
  • https://cdn.example.com/app.css_external/cdn.example.com/app.css when cross-origin asset capture is enabled
  • a saved page can reference that asset as ../../_external/cdn.example.com/app.css when URL rewriting is enabled

Behavior Notes

  • Playwright browser installation is ensured automatically before the crawl starts.
  • Duplicate detection ignores query strings by default.
  • HTML formatting is opt-in and uses Soenneker.Html.Formatter when FormatHtml = true.
  • Challenge and captcha-like pages contribute to the crawler's blocking and slow-mode signals.
  • Setting ThrottleMode = PlaywrightCrawlThrottleMode.Disabled keeps configured concurrency limits and retries, but skips the crawler's automatic pacing and adaptive slowdown behavior.
  • Cross-origin URL rewriting only applies to captured cross-origin assets that are actually available on disk.
  • Full mode captures resources observed during page loads, but the rewrite pass is limited to captured cross-origin asset URLs rather than a full offline-mirroring transform.
  • Some response types are intentionally skipped, such as empty bodies and certain framework/internal fetch endpoints.
Product Versions Compatible and additional computed target framework versions.
.NET net10.0 net10.0 is compatible.  net10.0-android net10.0-android was computed.  net10.0-browser net10.0-browser was computed.  net10.0-ios net10.0-ios was computed.  net10.0-maccatalyst net10.0-maccatalyst was computed.  net10.0-macos net10.0-macos was computed.  net10.0-tvos net10.0-tvos was computed.  net10.0-windows net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
4.0.101 0 6/19/2026
4.0.100 0 6/19/2026
4.0.99 19 6/18/2026
4.0.98 86 6/18/2026
4.0.97 98 6/18/2026
4.0.96 94 6/17/2026
4.0.94 114 6/17/2026
4.0.92 100 6/17/2026
4.0.91 103 6/16/2026
4.0.90 214 6/14/2026
4.0.89 105 6/13/2026
4.0.88 101 6/13/2026
4.0.87 240 6/11/2026
4.0.86 101 6/11/2026
4.0.85 131 6/10/2026
4.0.84 96 6/10/2026
4.0.83 107 6/10/2026
4.0.82 110 6/10/2026
4.0.81 144 6/10/2026
4.0.80 153 6/9/2026
Loading failed