VOOZH about

URL: https://apify.com/jan_hilgard/validated-jobs-scraper

โ‡ฑ Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence ยท Apify


๐Ÿ‘ Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored avatar

Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored

Pricing

$2.50 / 1,000 job results

Go to Apify Store

Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored

Job data with a correctness guarantee: per-field confidence, ghost-job filtering and cross-source dedup โ€” never silently wrong, duplicated or expired. Reaches LinkedIn and ATS boards cookieless, gets through Cloudflare/DataDome, self-healing on layout shifts. Built on the data.hilgard.cz engine.

Pricing

$2.50 / 1,000 job results

Rating

0.0

(0)

Developer

๐Ÿ‘ Jan Hilgard

Jan Hilgard

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

a day ago

Last modified

Share

Validated Jobs Scraper

Job data that is deduplicated, ghost-filtered and never silently wrong.

You give it a source and a query (an ATS company, or keywords for LinkedIn / Indeed). You get back clean job records โ€” title, company, location, salary, employment and workplace type โ€” and on top of every record a correctness layer: a confidence on each field, a single reliable flag, a ghost-job score, and cross-source dedup. When a field does not clear the bar you get a null with a stated reason (status: "absent"), not a guessed value. When the same opening sits on several boards you get it once, with the others listed under also_at. That guarantee is the product; the engine underneath is only how it is kept.

Most job scrapers optimise for how many fields they return. That is the easy part. The hard part โ€” and the expensive one when it goes wrong โ€” is a job that is silently duplicated, already filled, reposted for the tenth time, or quietly mis-parsed. This one is built around that: it scores its own confidence per field, flags ghost jobs, and collapses duplicates across sources, so what you load is what is real.

Keywords: no silent errors, per-field confidence, validated jobs, ghost job detection, deduplicated jobs, hiring intent, cookieless jobs scraper, linkedin jobs scraper, indeed jobs scraper, ats greenhouse lever ashby, job posting api, labor market data, hiring intent signals, sales prospecting jobs.


What a real run looks like

One query in, one row per job out โ€” each in the engine's snake_case schema, with the per-field confidence and the enrichment block attached. Below is a real record from a Greenhouse run (the two description_* fields are large and omitted here for brevity; every enrichment field is wrapped as { value, status, score, source, model }).

(a) A reliable record โ€” and where the page has no salary, it says so instead of guessing.

{
"source":"greenhouse",
"canonical_url":"https://job-boards.greenhouse.io/gitlab/jobs/8565469002",
"apply_url":"https://job-boards.greenhouse.io/gitlab/jobs/8565469002",
"title":"AI Engineer",
"company":{"name":"GitLab","url":null},
"location":{"raw":"Remote, US","city":null,"region":null,"country":null,"workplace_type":"remote"},
"employment_type":null,
"date_posted":"2026-05-29",
"salary":{"min":null,"max":null,"currency":null,"period":null,"source":"absent"},
"seniority":null,
"skills":[],
"reliable":true,
"overall_confidence":0.95,
"fields":{
"title":{"value":"AI Engineer","status":"confirmed","score":0.95,"source":"ats"},
"company":{"value":"GitLab","status":"confirmed","score":0.95,"source":"ats"},
"location":{"value":"Remote, US","status":"confirmed","score":0.95,"source":"ats"},
"salary":{"value":null,"status":"absent","score":null,"source":null},
"employment_type":{"value":null,"status":"absent","score":null,"source":null}
},
"also_at":[],
"enrichment":{
"quality":{
"ghost_job_score":{"value":0.1,"status":"low","score":0.5,"source":"inferred","model":"qwen3.6-35b@1"},
"flags":{"value":[],"status":"absent","score":null,"source":"inferred","model":null},
"is_real":{"value":true,"status":"low","score":0.5,"source":"inferred","model":"qwen3.6-35b@1"}
},
"dedup":{
"is_duplicate_of":{"value":null,"status":"absent","score":null,"source":"inferred","model":null},
"sources_seen":{"value":["greenhouse"],"status":"confirmed","score":0.95,"source":"inferred","model":null}
},
"normalized":{
"role_normalized":{"value":"AI Engineer","status":"low","score":null,"source":"inferred","model":"qwen3.6-35b@1"},
"seniority_normalized":{"value":"Mid","status":"low","score":null,"source":"inferred","model":"qwen3.6-35b@1"},
"skills":{"value":[{"name":"Python","type":"nice_to_have","score":0.9},{"name":"TypeScript","type":"nice_to_have","score":0.9},{"name":"LLMs","type":"nice_to_have","score":0.9}],"status":"high","score":0.9,"source":"inferred","model":"qwen3.6-35b@1"}
},
"hiring_intent":{
"buying_signal_score":{"value":0.11,"status":"low","score":0.45,"source":"inferred","model":null},
"company_signals":{"value":{"open_roles_count":1,"role_velocity":0.03,"expanding_departments":[]},"status":"low","score":0.45,"source":"inferred","model":null}
}
},
"company_name":"GitLab",
"location_raw":"Remote, US",
"ghost_job_score":0.1,
"success":true,
"error":null
}

Note the salary and employment_type: the page did not state them, so they come back status: "absent" with a null value โ€” not a guessed band. The ghost_job_score is low (0.1), so the record is trusted. company_name, location_raw and ghost_job_score at the bottom are flat copies the actor lifts out of the nested objects for the dataset table.

(b) A ghost-suspect job with thin data. It does NOT guess โ€” it flags and fails loud. (Illustrative record in the same real shape โ€” a live ghost can't be produced on demand.)

{
"source":"linkedin",
"canonical_url":"https://www.linkedin.com/jobs/view/...",
"title":"Marketing Manager",
"company":{"name":"Stealth Startup","url":null},
"location":{"raw":null,"city":null,"region":null,"country":null,"workplace_type":null},
"employment_type":null,
"salary":{"min":null,"max":null,"currency":null,"period":null,"source":"absent"},
"reliable":false,
"overall_confidence":0.39,
"fields":{
"title":{"value":"Marketing Manager","status":"high","score":0.9,"source":"html"},
"company":{"value":"Stealth Startup","status":"low","score":0.41,"source":"html"},
"location":{"value":null,"status":"absent","score":null,"source":null},
"salary":{"value":null,"status":"absent","score":null,"source":null}
},
"also_at":[],
"enrichment":{
"quality":{
"ghost_job_score":{"value":0.78,"status":"low","score":0.6,"source":"inferred","model":"qwen3.6-35b@1"},
"flags":{"value":["reposted","vague_jd"],"status":"high","score":0.7,"source":"inferred","model":"qwen3.6-35b@1"},
"is_real":{"value":false,"status":"low","score":0.6,"source":"inferred","model":"qwen3.6-35b@1"}
}
},
"ghost_job_score":0.78,
"success":true,
"error":null
}

The value is the second row. A cheaper tool would have returned this as just another clean-looking hit. Here the ghost_job_score is high (0.78), the flags say why (reposted, vague_jd), is_real is false, the weak company field is low, and the absent ones are marked absent โ€” not filled with a guess. (Note: on LinkedIn an empty apply link is normal, so no apply-related flag fires โ€” the no_real_apply flag only fires on ATS boards where a real apply path is expected.) Turn drop_ghost on and a row like this is removed before it reaches you โ€” and not charged.


Why this beats LinkedIn-only and AI scrapers

Adapting to a layout and scraping a lot of fields is table-stakes now โ€” this does both. But scraping is not the same as being right. A scraper can return a job and still hand you one that is duplicated three times, already filled, reposted for months, or quietly mis-parsed โ€” and say nothing. That silent bad row is the one that costs you, because you act on it. The difference here is the guarantee, not the scraping: every field carries its own confidence, every record carries a ghost-job score, duplicates are collapsed across sources, and a cheap enrichment runs on every record โ€” not just a sampled few.

Two things others quietly skip:

  • They return rich fields but zero confidence, and their high success rate needs your cookies. This runs cookieless and still attaches a confidence to every field, so you can tell a solid row from a shaky one without logging anything in.
  • Their AI enrichment is shallow and expensive because it is API-bound. Ours runs on cheap local inference, so the quality / dedup / normalize / hiring-intent layers run on every record by default, not as a costly add-on on a handful.

Why it is different

  • No silent errors. Every field carries a confidence, the whole record carries a reliable flag. Below the bar a field is returned null with status: "absent" and reliable: false, never a confident-looking wrong value.
  • Ghost-job filtering and cross-source dedup. Each record gets a ghost_job_score with the flags behind it (reposted, evergreen, vague_jd, staffing_agency, perpetual_req, and no_real_apply on ATS), so stale and fake openings are visible โ€” or dropped with drop_ghost. The same opening seen on several sources is collapsed into one record, the rest listed under also_at. You load real, distinct openings, not reposts and duplicates.
  • Cookieless reach / anti-bot. It reaches LinkedIn, Indeed and the major ATS boards without cookies, and gets through heavy protections โ€” Cloudflare, DataDome and similar โ€” that return a challenge page to a plain fetch. (Anti-bot is an arms race, so this is a capability, not a guarantee against any one named vendor.)
  • Self-healing โ€” a mechanism that serves the correctness above, not the headline. When a board changes its markup, the engine re-finds fields by meaning instead of silently breaking on a selector.
  • Hiring-intent signals. Enrichment normalises each role and adds hiring-intent signals (buying_signal_score, company_signals), so the data is usable for prospecting, labor analytics and recruiting research, not just a flat list of postings.

Supported sources

Live-verified sources only โ€” this list is what is actually tested, not a wishlist:

  • Greenhouse
  • Ashby
  • Lever
  • SmartRecruiters
  • RemoteOK
  • LinkedIn
  • Indeed

ATS sources (Greenhouse / Ashby / Lever / SmartRecruiters) take a company slug; LinkedIn, Indeed and RemoteOK take keywords (+ location). A concrete source is required. Indeed sits behind Cloudflare โ€” it is reached through the same anti-bot stack, cookieless. Indeed pay is usually an estimate, so it comes back as

salary.source:"inferred"
(never passed off as an employer-stated figure).

Input

{
"source": "greenhouse", // required: linkedin | indeed | greenhouse | lever | ashby | smartrecruiters | remoteok
"company": "gitlab", // ATS slug, for greenhouse/lever/ashby/smartrecruiters
"keywords": "backend engineer", // for LinkedIn / Indeed / RemoteOK search
"location": "Berlin",
"title_include": [], "title_exclude": [],
"employment_type": [], "workplace_type": [],
"country": [], "language": [], // arrays, e.g. ["DE"], ["en"]
"posted_within_days": 30,
"drop_expired": true, "drop_ghost": false,
"enrich": true,
"enrich_layers": ["quality", "dedup", "normalize", "hiring_intent"],
"dedup_across_sources": true,
"ghost_threshold": 0.7,
"start": 0, "limit": 25, "max_results": 100, "fetch_all": false,
"include_description": true // engine fetches full JD HTML/text (on by default; LinkedIn fetch is per-posting)
}

What a source needs depends on the source: ATS boards (Greenhouse / Lever / Ashby / SmartRecruiters) require company; LinkedIn and Indeed require keywords (or location); RemoteOK needs neither (it lists the feed). Pick a company for an ATS source and the actor fails loud early if it's missing, instead of forwarding a request the engine would reject. Filters and enrichment all run on the engine; the actor just forwards them. max_results caps how many jobs come back, and since you pay per returned job, it caps spend.

Output

One dataset row per job, in the engine's snake_case JobPosting schema. Every tracked field is always present โ€” a missing one comes back with status: "absent" and a null value, never silently dropped. Highlights:

  • Core fields: title, company ({name, url}), location ({raw, city, region, country, workplace_type}), employment_type (FULL_TIME / PART_TIME / CONTRACT / INTERN / TEMP), date_posted, salary ({min, max, currency, period, source} โ€” source is explicit / inferred / absent, never guessed), seniority, skills, canonical_url, apply_url.
  • Trust layer: reliable (bool), overall_confidence (0โ€“1), and fields โ€” a map where each tracked field carries { value, status, score, source }, with status โˆˆ confirmed | high | low | absent.
  • Dedup: top-level also_at[] lists the same opening on other portals ([] if unique).
  • Enrichment (present when enrich is on) under enrichment, each field wrapped as { value, status, score, source: "inferred", model }: quality (ghost_job_score, flags, is_real), dedup, normalized, hiring_intent.
  • Flat helpers the actor adds for the dataset table: company_name, location_raw, and ghost_job_score (lifted from enrichment.quality.ghost_job_score.value).

success says the engine produced a job record; reliable says that record cleared the trust threshold. They diverge when a job is extracted but the engine is not confident โ€” then success is true, reliable is false, and the per-field scores stay low, so a hedge never reads as a clean hit. Each row is the engine's record passed through verbatim โ€” the actor never drops or rewrites a field. The full job description (description_html / description_text) is included; include_description (on by default) controls whether the engine fetches it for sources fetched per-posting (e.g. LinkedIn). Descriptions can be large.


Pricing

This actor uses Pay Per Event, with a single event:

  • job-result โ€” one flat fee per job returned.

You are charged once per job the engine returns โ€” whether it comes back reliable: true or, honestly, reliable: false, because an honest fail is still a result you can act on (you learn the field is shaky instead of trusting a guess). Jobs the engine drops before returning โ€” expired, ghost (with drop_ghost), or duplicates collapsed across sources โ€” are not returned and not charged. A run that fails before returning any results is not charged. The current price of the event is in the Apify Console pricing tab.

No flat monthly fee, no per-seat pricing. One predictable price per validated job.


A note on data

This actor returns job and company data only โ€” title, company, location, salary, employment terms, and signals derived from the posting. It does not collect or return personal data of applicants or named recruiters.


About

Built on the data.hilgard.cz engine โ€” the same self-healing stack that does cookieless anti-bot fetching, extraction by meaning, and independent verification, here applied to jobs with ghost-job scoring and cross-source dedup on top.

By Jan Hilgard โ€” founder of Hosting90 (built 2002, exited 2020), contributor to vllm-mlx. The precision-first stance is deliberate: I would rather return an honest "not sure" than a confident wrong row you build a pipeline on.

Development

npminstall
npm run build # tsc โ†’ dist/
npm run start:dev # tsx src/main.ts, reads .actor/INPUT.json

You might also like

Ghost Email Scraper - Advanced, Fast & Cheapest

contacts-api/ghost-email-scraper-fast-advanced-and-cheapest

๐Ÿ‘ป Ghost Email Scraper helps you find writer and publication emails from Ghost-powered sites ๐Ÿ” Ideal for content marketing and partnerships ๐Ÿ“ง

Ghost Job Detector

badruddeen/ghost-job-detector

Identify ghost, fake, or reposted LinkedIn and company jobs. Monitors listings, extracts signals, and calculates a Hiring Likelihood Score to help job seekers focus on genuine opportunities.

๐Ÿ‘ User avatar

Badruddeen Naseem

13

Ghost Email Scraper โ€“ Advanced, Cheapest & Reliable ๐Ÿ“งโšก

contactminerlabs/ghost-email-scraper---advanced-cheapest-reliable

๐Ÿ” Scrape Ghost.org Emails Enter your search parameters to collect verified contact emails from public Ghost profiles, along with profile title, bio, source URL & platform info โœ‰๏ธ๐Ÿ“Š Perfect for lead generation, influencer outreach & data enrichment in tools like Google Sheets or CRMsโšก๐Ÿงฉ

๐Ÿ‘ User avatar

ContactMinerLabs

5

Jobs.cz Listing Scraper

powerbox/jobs-listing-scraper

Scrape job listings from Jobs.cz by providing a search URL, with automatic pagination and comprehensive job information extraction.

Greenhouse Jobs Scraper & API

jobo.world/greenhouse-jobs-scraper-api

Scrape jobs directly from Greenhouse ATS via one fast API. Pull live roles from top tech companies like Airbnb, Stripe & Discord. Zero ghost jobs, verified daily โ€” built for job boards, AI agents & tech hiring research.

LinkedIn Jobs Scraper

dataharvest/linkedin-jobs-scraper

Scrape job listings from LinkedIn Jobs.

Expired Jobs API for: Advanced LinkedIn Job Search API

fantastic-jobs/expired-jobs-api-for-advanced-linkedin-job-search-api

This Actor is a companion to the Advanced LinkedIn Job Search API Actor. It provides expired jobs once per day.

๐Ÿ‘ User avatar

Fantastic.jobs

11

SmartRecruiters Jobs Scraper & API

jobo.world/smartrecruiters-jobs-scraper-api

Scrape jobs directly from SmartRecruiters ATS via one fast API. Live roles from global enterprises and mid-market employers across every industry. Zero ghost jobs, verified daily โ€” built for job boards, AI agents & hiring research.