👁 Image
Exoscan 4.0.1

.NET 6.0

dotnet add package Exoscan --version 4.0.1

NuGet\Install-Package Exoscan -Version 4.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Exoscan" Version="4.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Exoscan" Version="4.0.1" />
 

 Directory.Packages.props

<PackageReference Include="Exoscan" />
 

 Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Exoscan --version 4.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Exoscan, 4.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Exoscan@4.0.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Exoscan&version=4.0.1
 

 Install as a Cake Addin

#tool nuget:?package=Exoscan&version=4.0.1
 

 Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Exoscan

👁 NuGet
👁 FOSSA Status
👁 build status

Please star this project if you find it useful!

Overview

Declarative high performance web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution. Easily crawl any web site and parse the data, save structed result to a file, DB, or pretty much to anywhere you want.

It provides a simple yet extensible API to make web scraping a breeze.

Install

dotnet add package Exoscan

Requirements

.NET 6

📋 Example:

using Exoscan.Core.Builders;

_ = new ScraperEngineBuilder()
 .Get("https://www.reddit.com/r/dotnet/")
 .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
 .Parse(new()
 {
 new("title", "._eYtD2XCVieq6emjKBH3m"),
 new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
 })
 .WriteToJsonFile("output.json")
 .LogToConsole()
 .Build()
 .Run();

Console.ReadLine();

Features:

⚡ It's extremly fast due to parallelism and asynchrony
🗒 Declarative parsing with a structured scheme
💾 Saving data to any sinks such as JSON or CSV file, MongoDB, CosmosDB, Redis, etc.
🌎 Distributed crawling support: run your web scraper on ony cloud VMs, serverless functions, on-prem servers, etc.
🐙 Crowling and parsing Single Page Applications with Puppeteer
🖥 Proxy support
🌀 Automatic reties

Usage examples

Data mining
Gathering data for machine learning
Online price change monitoring and price comparison
News aggregation
Product review scraping (to watch the competition)
Gathering real estate listings
Tracking online presence and reputation
Web mashup and web data integration
MAP compliance
Lead generation

API overview

SPA parsing example

Parsing single page applications is super simple, just use the GetWithBrowser and/or FollowWithBrowser method. In this case Puppeteer will be used to load the pages.

_ = new ScraperEngineBuilder()
 .GetWithBrowser("https://www.reddit.com/r/dotnet/")
 .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
 .Parse(new()
 {
 new("title", "._eYtD2XCVieq6emjKBH3m"),
 new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
 })
 .WriteToJsonFile("output.json")
 .LogToConsole()
 .Build()
 .Run(1);

Additionaly, you can run any JavaScript on dynamic pages as they are loaded with headless browser. In order to do that you need to add some page actions:

using Exoscan.Core.Builders;

_ = new ScraperEngineBuilder()
 .GetWithBrowser("https://www.reddit.com/r/dotnet/", actions => actions
 .ScrollToEnd()
 .Build())
 .Follow("a.SQnoC3ObvgnGjWt90zD9Z._2INHSNB8V5eaWp4P0rY_mE")
 .Parse(new()
 {
 new("title", "._eYtD2XCVieq6emjKBH3m"),
 new("text", "._3xX726aBn29LDbsDtzr_6E._1Ap4F5maDtT1E1YuCiaO0r.D3IL3FD0RFy_mkKLPwL4")
 })
 .WriteToJsonFile("output.json")
 .LogToConsole()
 .Build()
 .Run(1);

Console.ReadLine();

It can be helpful if the required content is loaded only after some user interactions such as clicks, scrolls, etc.

Persist the progress locally

If you want to persist the vistited links and job queue locally, so that you can start crawling where you left off you can use ScheduleWithTextFile and TrackVisitedLinksInFile methods:

var engine = new ScraperEngineBuilder()
 .WithLogger(logger)
 .Get("https://rutracker.org/forum/index.php?c=33")
 .Follow("#cf-33 .forumlink>a")
 .Follow(".forumlink>a")
 .Paginate("a.torTopic", ".pg")
 .Parse(new()
 {
 new("name", "#topic-title"),
 new("category", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(3)"),
 new("subcategory", "td.nav.t-breadcrumb-top.w100.pad_2>a:nth-child(5)"),
 new("torrentSize", "div.attach_link.guest>ul>li:nth-child(2)"),
 new("torrentLink", ".magnet-link", "href"),
 new("coverImageUrl", ".postImg", "src")
 })
 .WriteToJsonFile("result.json")
 .IgnoreUrls(blackList)
 .ScheduleWithTextFile("jobs.txt", "progress.txt")
 .TrackVisitedLinksInFile("links.txt")
 .Build();

Authorization

If you need to pass authorization before parsing the web site, you can call SetCookies method on Scraper that has to fill CookieContainer with all cookies required for authorization. You are responsible for performing the login operation with your credentials, the Scraper only uses the cookies that you provide.

_ = new ScraperEngineBuilder()
 .WithLogger(logger)
 .Get("https://rutracker.org/forum/index.php?c=33")
 .SetCookies(cookies =>
 {
 cookies.Add(new Cookie("AuthToken", "123");
 })

Distributed web scraping with Serverless approach

In the Examples folder you can find the project called Exoscan.AzureFuncs. It demonstrates the use of Exoscan with Azure Functions. It consists of two serverless functions:

StartScrapting

First of all, this function uses ScraperConfigBuilder to build the scraper configuration e. g.:

Secondly, this function writes the first web scraping job with startUrl to the Azure Service Bus queue:

ExoscanSpider

This Azure function is triggered by messages sent to the Azure Service Bus queue. Messages represent web scraping job.

Firstly, this function builds the spider that is going to execute the job from the queue.

Secondly, it executes the job by loading the page, parsing content, saving to the database, etc.

Finally, it iterates through these new jobs and sends them the the Job queue.

Extensibility

Adding a new sink to persist your data

Out of the box there are 4 sinks you can send your parsed data to: ConsoleSink, CsvFileSink, JsonFileSink, CosmosSink (Azure Cosmos database).

You can easly add your own by implementing the IScraperSink interface:

public interface IScraperSink
{
 public Task EmitAsync(ParsedData data);
}

Here is an example of the Console sink:

public class ConsoleSink : IScraperSink
{
 public Task EmitAsync(ParsedData parsedItam)
 {
 Console.WriteLine($"{parsedItam.Data.ToString()}");
 return Task.CompletedTask;
 }
}

Adding your sink to the Scraper is simple, just call AddSink method on the Scraper:

_ = new ScraperEngineBuilder()
 .AddSink(new ConsoleSink());
 .Get("https://rutracker.org/forum/index.php?c=33")
 .Follow("#cf-33 .forumlink>a")
 .Follow(".forumlink>a")
 .Paginate("a.torTopic", ".pg")
 .Parse(new() {
 new("name", "#topic-title"),
 });

For other ways to extend your functionality see the next section.

Intrefaces

Interface	Description
IScheduler	Reading and writing from the job queue. By default, the in-memory queue is used, but you can provider your implementation
IVisitedLinkTracker	Tracker of visited links. A default implementation is an in-memory tracker. You can provide your own for Redis, MongoDB, etc.
IPageLoader	Loader that takes URL and returns HTML of the page as a string
IContentParser	Takes HTML and schema and returns JSON representation (JObject).
ILinkParser	Takes HTML as a string and returns page links
IScraperSink	Represents a data store for writing the results of web scraping. Takes the JObject as parameter
ISpider	A spider that does the crawling, parsing, and saving of the data

Main entities

Job - a record that represents a job for the spider
LinkPathSelector - represents a selector for links to be crawled

Repository structure

Project	Description
Exoscan	Library for web scraping
Exoscan.ScraperWorkerService	Example of using Exoscan library in a Worker Service .NET project.
Exoscan.DistributedScraperWorkerService	Example of using Exoscan library in a distributed way wih Azure Service Bus
Exoscan.AzureFuncs	Example of using Exoscan library with serverless approach using Azure Functions
Exoscan.ConsoleApplication	Example of using Exoscan library with in a console application

Coming soon:

Nuget package
Azure functions for the distributed crawling
Loading pages with headless browser and flexible SPA page manipulations (clicks, scrolls, etc)
Proxy support
Add caching of html pages
Add flexible conditions for ignoring or allowing certain pages
Breadth first traversal with priority channels
Save auth cookies to redis, mongo, etc.
Separate nuget packages for MaongoDb, Cosmos Db, Redis, etc.
Sitemap crawling support
Add LogTo method with Console and File support

Features under consideration

Request auto throttling
Add bloom filter for revisiting same urls
Simplify ExoscanSpider class
Subscribe to logs with lambda expression

See the file for license rights and limitations (GNU GPLv3).

Product	Versions Compatible and additional computed target framework versions.
.NET	net6.0 net6.0 is compatible. net6.0-android net6.0-android was computed. net6.0-ios net6.0-ios was computed. net6.0-maccatalyst net6.0-maccatalyst was computed. net6.0-macos net6.0-macos was computed. net6.0-tvos net6.0-tvos was computed. net6.0-windows net6.0-windows was computed. net7.0 net7.0 was computed. net7.0-android net7.0-android was computed. net7.0-ios net7.0-ios was computed. net7.0-maccatalyst net7.0-maccatalyst was computed. net7.0-macos net7.0-macos was computed. net7.0-tvos net7.0-tvos was computed. net7.0-windows net7.0-windows was computed. net8.0 net8.0 was computed. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.

Product

Versions Compatible and additional computed target framework versions.

.NET

net6.0 net6.0 is compatible. net6.0-android net6.0-android was computed. net6.0-ios net6.0-ios was computed. net6.0-maccatalyst net6.0-maccatalyst was computed. net6.0-macos net6.0-macos was computed. net6.0-tvos net6.0-tvos was computed. net6.0-windows net6.0-windows was computed. net7.0 net7.0 was computed. net7.0-android net7.0-android was computed. net7.0-ios net7.0-ios was computed. net7.0-maccatalyst net7.0-maccatalyst was computed. net7.0-macos net7.0-macos was computed. net7.0-tvos net7.0-tvos was computed. net7.0-windows net7.0-windows was computed. net8.0 net8.0 was computed. net8.0-android net8.0-android was computed. net8.0-browser net8.0-browser was computed. net8.0-ios net8.0-ios was computed. net8.0-maccatalyst net8.0-maccatalyst was computed. net8.0-macos net8.0-macos was computed. net8.0-tvos net8.0-tvos was computed. net8.0-windows net8.0-windows was computed. net9.0 net9.0 was computed. net9.0-android net9.0-android was computed. net9.0-browser net9.0-browser was computed. net9.0-ios net9.0-ios was computed. net9.0-maccatalyst net9.0-maccatalyst was computed. net9.0-macos net9.0-macos was computed. net9.0-tvos net9.0-tvos was computed. net9.0-windows net9.0-windows was computed. net10.0 net10.0 was computed. net10.0-android net10.0-android was computed. net10.0-browser net10.0-browser was computed. net10.0-ios net10.0-ios was computed. net10.0-maccatalyst net10.0-maccatalyst was computed. net10.0-macos net10.0-macos was computed. net10.0-tvos net10.0-tvos was computed. net10.0-windows net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net6.0
- Azure.Messaging.ServiceBus (>= 7.11.0)
- Fizzler.Systems.HtmlAgilityPack (>= 1.2.1)
- Microsoft.Azure.Cosmos (>= 3.31.1)
- Microsoft.Extensions.Http (>= 6.0.0)
- Microsoft.Extensions.Logging.Abstractions (>= 6.0.2)
- MongoDB.Driver (>= 2.18.0)
- Newtonsoft.Json (>= 13.0.1)
- Polly (>= 7.2.3)
- PuppeteerExtraSharp (>= 1.3.2)
- PuppeteerSharp (>= 7.1.0)
- StackExchange.Redis (>= 2.6.70)
- System.Text.Encoding.CodePages (>= 6.0.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
4.0.1	490	11/18/2022
4.0.0	467	11/18/2022

URL: https://www.nuget.org/packages/Exoscan/

⇱ NuGet Gallery | Exoscan 4.0.1

👁 Image
Exoscan 4.0.1

Exoscan

Overview

Install

Requirements

📋 Example:

Features:

Usage examples

API overview

SPA parsing example

Persist the progress locally

Authorization

Distributed web scraping with Serverless approach

StartScrapting

ExoscanSpider

Extensibility

Adding a new sink to persist your data

Intrefaces

Main entities

Repository structure

Coming soon:

Features under consideration

net6.0

NuGet packages

GitHub repositories

URL: https://www.nuget.org/packages/Exoscan/

⇱ NuGet Gallery | Exoscan 4.0.1

👁 Image Exoscan 4.0.1

Exoscan

Overview

Install

Requirements

📋 Example:

Features:

Usage examples

API overview

SPA parsing example

Persist the progress locally

Authorization

Distributed web scraping with Serverless approach

StartScrapting

ExoscanSpider

Extensibility

Adding a new sink to persist your data

Intrefaces

Main entities

Repository structure

Coming soon:

Features under consideration

net6.0

NuGet packages

GitHub repositories

👁 Image
Exoscan 4.0.1