![]() |
VOOZH | about |
dotnet add package ScrapeAAS.HttpClient --version 1.1.4
NuGet\Install-Package ScrapeAAS.HttpClient -Version 1.1.4
<PackageReference Include="ScrapeAAS.HttpClient" Version="1.1.4" />
<PackageVersion Include="ScrapeAAS.HttpClient" Version="1.1.4" />Directory.Packages.props
<PackageReference Include="ScrapeAAS.HttpClient" />Project file
paket add ScrapeAAS.HttpClient --version 1.1.4
#r "nuget: ScrapeAAS.HttpClient, 1.1.4"
#:package ScrapeAAS.HttpClient@1.1.4
#addin nuget:?package=ScrapeAAS.HttpClient&version=1.1.4Install as a Cake Addin
#tool nuget:?package=ScrapeAAS.HttpClient&version=1.1.4Install as a Cake Tool
ScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.
Add ASP.NET Hosting, ScrapeAAS, a validator of your choice (here Dawn.Guard RIP), and a object mapper of your choice (here AutoMapper), and the database/messagequeue you feel most comftable with (here EFcore with SQLite).
dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection
of scraping the r/dotnet subreddit.
Create a crawler, a that service periodically triggers scraping
var builder = Host.CreateApplicationBuilder(args);
builder.Services
.AddAutoMapper()
.AddScrapeAAS()
.AddHostedService<RedditSubredditCrawler>()
.AddDataflow<RedditPostSpider>()
.AddDataflow<RedditSqliteSink>()
sealed class RedditSubredditCrawler : BackgroundService {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher<RedditPost> _publisher;
...
protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
... execute service scope periotically
}
private async Task CrawlAsync(IDataflowPublisher<RedditSubreddit> publisher, CancellationToken stoppingToken)
{
_logger.LogInformation("Crawling /r/dotnet");
await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
_logger.LogInformation("Crawling complete");
}
}
Implement your spiders, services that collect, and normalize data.
sealed class RedditPostSpider : IDataflowHandler<RedditSubreddit> {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher<RedditComment> _publisher;
...
private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
{
Url root = new("https://old.reddit.com/");
_logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
_logger.LogInformation("Request complete");
var queriedContent = document
.QuerySelectorAll("div.thing")
.AsParallel()
.Select(div => new
{
PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
Title = div.QuerySelector("a.title")?.TextContent,
Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
Comments = div.QuerySelector("a.comments")?.TextContent,
CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
PostedBy = div.QuerySelector("a.author")?.TextContent,
})
.Select(queried => new RedditPost(
new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
Guard.Argument(queried.Title).NotEmpty(),
long.Parse(queried.Upvotes.AsSpan()),
Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
new(queried.CommentsUrl),
DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
new(Guard.Argument(queried.PostedBy).NotEmpty())
), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
foreach (var item in queriedContent)
{
await _publisher.PublishAsync(item, stoppingToken);
}
_logger.LogInformation("Parsing complete");
}
}
Add a sink, a service that commits the scraped data disk/network.
sealed class RedditSqliteSink : IAsyncDisposable, IDataflowHandler<RedditSubreddit>, IDataflowHandler<RedditPost>
{
private readonly RedditPostSqliteContext _context;
private readonly IMapper _mapper;
...
public async ValueTask DisposeAsync()
{
await _context.Database.EnsureCreatedAsync();
await _context.SaveChangesAsync();
}
public async ValueTask HandleAsync(RedditSubreddit message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map<RedditSubredditDto>(message);
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Subreddits.AddAsync(messageDto, cancellationToken);
}
public async ValueTask HandleAsync(RedditPost message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map<RedditPostDto>(message);
if (await _context.Users.FindAsync(new object[] { message.PostedBy.Id }, cancellationToken) is { } existingUser)
{
messageDto.PostedById = existingUser.Id;
messageDto.PostedBy = existingUser;
}
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Posts.AddAsync(messageDto, cancellationToken);
}
}
I have tried both toolstacks, and found them wanting. So I tried to make it better by delegating as much work as reasonable to existing projects.
In addition to my own goals; from evaluating both libraries I wish to keep all thier pros, and discard all their cons. The verbocity of this library sits comtably between WebReaper and DotnetSpider, but more towards the DotnetSpider end of things.
The overall data flow in ScrapeAAS is adopted from DotnetSpider: Crawler --> Spider --> Sink .
dynamic riddeled design when storing to a database.The Puppeteer browser handling is a mixture of the lifetime tracking http handler and the WebReaper Puppeteer integration.
Redis, MySql, RabbitMq, are always included in the package.| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 net8.0 is compatible. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 is compatible. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed. |
This package is not used by any NuGet packages.
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.1.4 | 362 | 11/11/2025 |
| 1.1.3 | 358 | 11/11/2025 |
| 1.1.2 | 225 | 9/7/2025 |
| 1.1.1 | 227 | 9/6/2025 |
| 1.0.3 | 223 | 12/31/2023 |
| 1.0.2 | 151 | 12/31/2023 |
| 1.0.1 | 143 | 12/31/2023 |
| 1.0.0 | 147 | 12/21/2023 |
| 0.1.2 | 186 | 11/5/2023 |
| 0.1.1 | 127 | 10/15/2023 |
| 0.1.0 | 119 | 10/14/2023 |
| 0.1.0-hotfix.1 | 104 | 10/15/2023 |
| 0.1.0-alpha.3 | 105 | 10/14/2023 |
| 0.0.0-preview.0.71 | 117 | 10/14/2023 |