VOOZH about

URL: https://apify.com/lukaskrivka/dedup-datasets

⇱ Merge, Dedup & Transform Datasets · Apify


👁 Merge, Dedup & Transform Datasets avatar

Merge, Dedup & Transform Datasets

Pricing

Pay per usage

Go to Apify Store

Merge, Dedup & Transform Datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Pricing

Pay per usage

Rating

5.0

(1)

Developer

👁 Lukáš Křivka

Lukáš Křivka

Maintained by Community

Actor stats

97

Bookmarked

5.1K

Total users

83

Monthly active users

8 days ago

Last modified

Share

The ultimate dataset processing actor - merge, dedup & transform

Refined and optimized dataset processing actor for large scale merging, deduplications and transformation

Why to use this actor

  • Extremely fast data processing thanks for parallelizing workloads (easily 20x faster than default loading/pushing datasets)
  • Allows reading from multiple datasets silmutanesously, ideal for merging after scraping with many runs
  • Actor migration proof - All steps that can be persisted are persisted => work is not repeated and no duplicated data pushed
  • Dedup as loading mode allows for near constant memory processing even for huge datasets (think 10M+)
  • Deduplication allows for combination of many fields and even nested objects/arrays (those are JSON.stringified for deep equality check)
  • Allows for storing into KV store records
  • Allows super fast blank runs that count duplicates

Merging

You can provide more than one dataset. In that case all items are merged into single dataset or key value store output. If you use the Dedup after load mode, the order of items will retain the order of datasets provided.

Deduplication

If you optionally provide deduplication fields, this actor will deduplicate the dataset items. The deduplication process check the values of each field for equality and only return the first unique one (the first item that has a unique value for that field).

You can provide more than one field. In that case a combined string of that fields is checked, e.g. "name": "Adidas Shoes, "id": "12345" gets converted into "Adidas Shoes12345" for the checking purpose. So only items that have both fields the same are considered duplicates. This means the more fields you add, the less duplicates will be found.

Fields that are objects or arrays are also deeply compared via JSON.stringify. Just be aware that doing this for very large structures might have performance implications.

Transformation

This actor enables you to do arbitrary data transformations before and after deduplication via preDedupTransformFunction and postDedupTransformFunction.

These functions simply take the array of items and should return array of items. You don't need to necessarily return the same amount of items (can filter some out or add new ones).

You can access an object with helper variables, currently containing the Apify SDK reference

The default transformation does nothing with the items:

(items,{ Apify, customInputData })=>{
return items;
}

In case of dedup-as-loading mode, you only have access to the items of the specific batch. But you can also access datasetId and datasetOffset parameters as each batch is only from one dataset.

(items,{ Apify, datasetId, datasetOffset, customInputData })=>{
return items;
}

Input

Detailed INPUT table with description can be found on the actor's public page.

Changelog

Check the list of past updates here

You might also like

Universal MCP Connector

lukaskrivka/universal-mcp-connector

Universal MCP Connector allows you to run arbitrary workflows on the Apify platform and then publish the results on other platforms that support MCP protocol (most popular platforms do)

👁 User avatar

Lukáš Křivka

25

Thomson Local Scraper

dominic-quaiser/thomson-local-scraper

Extract UK business listings, contact details, and opening hours from Thomson Local (www.thomsonlocal.com) via console or API. One of the UK's largest business directories with over 2.6 million listings.

👁 User avatar

Dominic M. Quaiser

16

5.0

AI Code Sandbox

apify/ai-code-sandbox

Provides a secure execution environment for code generated by AI agents. Interact with the sandbox through web shell, REST API, or MCP. Supports Python and Node.js runtimes, Claude Code, Codex CLI, and OpenCode coding agents, and persists state.

Dataset MCP Uploader

lukaskrivka/dataset-mcp-uploader

Dataset MCP Uploader allows you to process datasets on the Apify platform and then publish the results on other platforms that support the MCP protocol

👁 User avatar

Lukáš Křivka

3

Yell Business Search Scraper

ecomscrape/yell-business-search-scraper

Professional Yell.com Business Search Scraper extracts comprehensive UK business data from Yell directory. Automatically collects company profiles, ratings, reviews, contact info & categories in structured JSON format. Perfect for market research, lead generation & competitive analysis.

ecomscrape

109

ZipRecruiter Jobs Scraper

datacach/ziprecruiter-search-jobs-by-keyword

Scrape ZipRecruiter job listings by keyword and location with filters for remote type, salary, employment type, experience level, and more. NO Proxy required.

6

5.0

Google Maps Rank Tracker

vasram/google-maps-rank-tracker

Track Google Maps rankings across a geo-grid heatmap. Get SoLV score, competitor leaderboard, gap analysis, zone breakdown, trend tracking & interactive HTML dashboard. Bulk scan up to 10 businesses. White-label reports. Free LocalFalcon & BrightLocal alternative — 50% cheaper, no monthly fees.

👁 User avatar

Vasram Sonagara

4

5.0

Google Maps Scraper Orchestrator

lukaskrivka/google-maps-scraper-orchestrator

Run multiple locations and search terms together with parallel runs for maximum speed.

👁 User avatar

Lukáš Křivka

353

5.0

Yell Scraper

mcdowell/yell-scraper

A scraper that extracts data from from Yell.com base on Keywords and Locality. Scrape and download business information data as HTML table, JSON, CSV, Excel, XML.

👁 User avatar

Victor McDowell

225

Checkatrade Scraper

vulnv/checkatrade

Scrape Checkatrade business listings with this Apify actor. Extract data like company names, reviews, ratings, contact details, and more—ideal for lead generation, market research, or competitor analysis.

Related articles

Dataset processing on Apify
Read more