VOOZH about

URL: https://apify.com/apify/page-analyzer

โ‡ฑ Page Scraping Analyzer ยท Apify


Pricing

Pay per usage

Go to Apify Store

Page Scraping Analyzer

Performs analysis of a webpage to figure out the best way how to scrape its data. Provide a URL and data points to find and get back a detailed dashboard showing how the data can be scraped. Works with initial and rendered HTML, JavaScript variables and dynamically loaded data.

Pricing

Pay per usage

Rating

4.7

(5)

Developer

๐Ÿ‘ Apify

Apify

Maintained by Apify

Actor stats

19

Bookmarked

1.3K

Total users

4

Monthly active users

7 months ago

Last modified

Share

Page Scraping Analyzer is an actor that helps its users find data sources on a website. Its main purpose is to help a user quickly analyze their options for extracting data from a website and provide CSS selectors, JavaScript code and HTTP requests that can be used to extract the data.

When to use Page Scraping Analyzer

Page Scraping Analyzer can be used as a first step in a web scraper developement. Its goal is to automate the process of analyzing a website manually using tools like browsers developer tools or Postman to:

  1. Analyze the structure of the website
  2. Find the CSS selectors of HTML elements containing a keyword
  3. Find keywords in additional sources that might not be visible on the screen like JSON+LD, metadata, schema.org data
  4. Observe and replicate XHR requests that might contain the data a user wants to scrape

Where is data stored on a website?

There are many sources of data on a website, some are not even visible on the screen. The same data point can be present in more than one source.

Here are some examples of where data can be stored on a website:

  • Initial HTML response (can be scraped by HTTP-only scrapers like Cheerio)
    • HTML elements rendered on the server
    • Rich JSON data inside <script> tags (JSON+LD, schema.org, Next.js data)
  • Rendered HTML (can be only scraped with a browser)
    • HTML elements rendered on the client
    • JavaScript variables available on the window object - data for can come from either:
      • Initial HTML response - Can be parsed from the script tags with HTTP only
      • XHR responses - Loaded later after the initial HTML response
  • XHR responses (can be scraped with HTTP-only scrapers like Cheerio)
    • Usually comes as JSON data loaded from an internal API. Common formats are:
      • REST API
      • GraphQL API
      • WebSocket connections
    • Can be in any other format like HTML snippets

How Page Scraping Analyzer works

The Page Scraping Analyzer works in multiple steps looking for data sources. For every step, it stores the sources and provides a CSS selector, JavaScript code or an HTTP request that can be used to extract the data.

It uses both browser and HTTP to provide all options to scrape the available data.

With browser:

  1. Open the page and records the initial HTML response. Finds all HTML elements and <script> tags containing the keywords.
  2. Waits for the page to render. Finds all HTML elements and JavaScript variables containing the keywords. Stores a diff between the initial HTML response and the rendered HTML.
  3. Waits for the page to load all XHR requests. Finds all XHR responses containing the keywords.

With HTTP:

  1. The same as step 1 with the browser - It is useful to know that you can scrape the same data with HTTP-only scrapers like Cheerio because it is much faster and cheaper.
  2. Tries to replicate the XHR requests recorded by the browser to see if they can be scraped only with HTTP:
    • First tries to use only generic HTTP headers
    • If it fails, it tries to use the headers recorded by the browser without cookies
    • If it fails, it tries to use the headers recorded by the browser with cookies (this can still be automated but requires to get cookies from the browser and then use them for X HTTP requests)

What scraping methods to choose after analysis

Some websites will require to combine multiple sources of data. Some are faster & cheaper to use, some are in nicer formats. Generally, it is best to try them in this order:

  1. XHR requests with HTTP - extremely fast and cheap and usually in a nice format like JSON. Might require combining multiple requests to get all the data. Might require complex headers and body to be replicated.
  2. <script> tags from the initial HTML response - often contains all the data in a nice JSON format. Requires parsing the JSON out of the script text
  3. HTML elements from the initial HTML response - requires using multiple CSS selectors to get all the data.
  4. JavaScript variables from the rendered HTML - usually contains all data in nice JavaScript objects.
  5. HTML elements from the rendered HTML - requires using multiple CSS selectors to get all the data.
  6. Intercepting XHR requests with a browser - requires waiting and sometimes interaction with the page. Might require combining multiple requests to get all the data.
  7. HTML elements rendered after all XHR responses were processed - requires long waiting and sometimes interaction with the page. Requires using multiple CSS selectors to get all the data.

You might also like

Citation Builder

alizarin_refrigerator-owner/citation-builder

Build local SEO citations by automatically submitting your business NAP (Name, Address, Phone) to 45+ directories. Why Citations Matter Local citations are mentions of your business name, address, and phone number on other websites. They're a critical local SEO ranking factor:

๐Ÿ”ฅ FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer โ€” extracts website content, converts it to Markdown, and structures it for LLM training datasets.

๐Ÿ‘ User avatar

mohamed el hadi msaid

302

1.9

SEO Checker

louisdeconinck/seo-checker

SEO Checker is an advanced Actor that performs comprehensive on-site SEO analysis for any website. It crawls web pages and extracts crucial SEO elements, providing detailed insights to help improve your website's search engine optimization.

๐Ÿ‘ User avatar

Louis Deconinck

326

5.0

Sitemap URL Extractor

onescales/sitemap-url-extractor

Provide a website link to a sitemap.xml and the app will extract and list all URLs in the sitemap as well as additional data in the sitemap (i.e. https://onescales.com/sitemap.xml).

568

5.0

GEO Audit - AI Search Optimization Checker

lofomachines/geo-technical-audit

Analyze your web pages for Generative Engine Optimization (GEO). Get actionable insights to improve visibility in ChatGPT, Perplexity, Gemini, Google AI Overviews, and other AI search engines. Check schema markup, semantic HTML, E-E-A-T signals, and content structure.

92

Backlink Building Agent

daniil.poletaev/backlink-building-agent

The Backlink Building Agent automates backlink outreach by finding relevant pages & websites, extracting contacts from these websites, and then crafting personalized outreach sequences based on the content to these partners. These sequences can be used on email, LinkedIn, Twitter, & WhatsApp.

๐Ÿ‘ User avatar

Daniil Poletaev

462

SEO Site Checkup

canadesk/seo-site-checkup

Run checks for common SEO issues, speed optimizations, mobile usability, security and more!

๐Ÿ‘ User avatar

Canadesk Support

478

Google Keyword Suggestions Scraper

powerai/google-keywords-suggest-scraper

Get Google keyword suggestions and insights including search volume, competition level, and bid estimates for any keyword.

Related articles

Web scraping with JavaScript vs. Python in 2025
Read more
How to scrape a website (ultimate guide for 2025)
Read more
Pros and cons of web scraping
Read more