Data Deduplicator

Pricing

from $1.49 / 1,000 items processeds

Data Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

Pricing

from $1.49 / 1,000 items processeds

Rating

0.0

(0)

Developer

👁 ParseBird

ParseBird

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

19 days ago

Last modified

Data Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows automatically with case-insensitive matching and whitespace trimming built in.

Combine multiple Apify datasets and remove duplicates by URL, email, name + company, or any field combination. Case-insensitive matching and whitespace trimming built in.

ParseBird Infra Suite • Utility tools for data pipelines
🔗 HTTP Request Send API calls from the cloud	📚 Data Deduplicator ➤ You are here	🗡 Data Cleaner Clean nulls, normalize case, format phones & emails

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parsebird/dataset-deduplicator on Apify. Call:ApifyClient("TOKEN").actor("parsebird/dataset-deduplicator").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for deduplicated results. Key inputs:datasetIds(array of strings — Apify dataset IDs to merge),jsonData(array of objects — direct JSON input, alternative to datasetIds),fields(array of strings, required — field names for dedup key). Matching is case-insensitive with whitespace trimming. First occurrence is kept. Full actor spec: fetch build via GEThttps://api.apify.com/v2/acts/parsebird~dataset-deduplicator(Bearer TOKEN). Get token: https://console.apify.com/account/integrations

What does Data Deduplicator do?

This Actor merges one or more Apify datasets and removes duplicate rows based on fields you specify. It's the fastest way to clean up scraped data before analysis or export.

Single-field dedup — deduplicate by url, email, phone, or any single field
Composite key dedup — combine multiple fields like firstName + lastName + company to identify unique records
Smart matching — case-insensitive comparison with automatic whitespace trimming
Multi-dataset merge — combine items from multiple dataset IDs before deduplication
Direct JSON input — pass data directly as a JSON array instead of referencing datasets

How to use it (6 steps)

Run your scraper(s) — collect data into one or more Apify datasets
Copy the dataset ID(s) — find them in the Apify Console under your run's Storage tab
Choose your dedup fields — pick the field(s) that uniquely identify each record
Run this Actor — pass the dataset IDs and field names as input
Get clean data — deduplicated items appear in the output dataset

Input parameters

Parameter	Type	Required	Default	Description
`datasetIds`	string[]	No*	—	Apify dataset IDs to merge and deduplicate
`jsonData`	array	No*	—	Direct JSON array of objects to deduplicate
`fields`	string[]	Yes	—	Field names for the dedup key

*Provide either datasetIds or jsonData (or both).

Composite key examples

Use case	Fields	Effect
Unique URLs	`["url"]`	One row per URL
Unique emails	`["email"]`	One row per email address
Unique people	`["firstName", "lastName", "company"]`	One row per person at each company
Unique products	`["sku", "marketplace"]`	One row per SKU per marketplace

Output example

Deduplicated items retain their original structure — no fields are added or removed:

[
{"name":"John Doe","email":"john@example.com","company":"Acme"},
{"name":"Jane Smith","email":"jane@example.com","company":"Beta"},
{"name":"Bob Wilson","email":"bob@example.com","company":"Gamma"}
]

A stats key is stored in the key-value store:

{
"totalLoaded":5000,
"uniqueKept":3200,
"duplicatesRemoved":1800,
"datasetsProcessed":3
}

How to use via API

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("parsebird/dataset-deduplicator").call(run_input={
"datasetIds":["DATASET_ID_1","DATASET_ID_2"],
"fields":["email"],
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Unique items: {len(items)}")

Node.js

import{ ApifyClient }from'apify-client';
const client =newApifyClient({token:'YOUR_API_TOKEN'});
const run =await client.actor('parsebird/dataset-deduplicator').call({
datasetIds:['DATASET_ID_1','DATASET_ID_2'],
fields:['firstName','lastName','company'],
});
const{ items }=await client.dataset(run.defaultDatasetId).listItems();
console.log(`Unique items: ${items.length}`);

cURL

curl-X POST "https://api.apify.com/v2/acts/parsebird~dataset-deduplicator/runs?token=YOUR_API_TOKEN"\
-H"Content-Type: application/json"\
-d'{
 "datasetIds": ["DATASET_ID_1"],
 "fields": ["url"]
 }'

Tips and best practices

Start with a single field — url or email usually covers most use cases
Use composite keys carefully — the more fields, the stricter the matching (fewer duplicates found)
Matching is always case-insensitive with whitespace trimming — no configuration needed

Pricing

This Actor uses a pay-per-event pricing model.

Event	Price per event	Price per 1,000
`items-processed`	$0.00149	$1.49

Charged per 1,000 items loaded (not per unique item). Platform compute costs are additional.

👁 Dataset Deduplicator avatar

Dataset Deduplicator

automation-lab/dataset-dedup

Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.

👁 User avatar

Stas Persiianenko

👁 CSV Diff Tool avatar

CSV Diff Tool

automation-lab/csv-diff-tool

Compare two CSV datasets and find added, removed, and modified rows. Supports key-column matching, configurable delimiters, case-insensitive comparison, and whitespace trimming. Exports a structured change report with before/after values.

👁 User avatar

Stas Persiianenko

👁 Deduplicate, Merge & Transform Datasets avatar

Deduplicate, Merge & Transform Datasets

datacach/deduplicate-datasets

Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.

👁 User avatar

DataCach

👁 Data Cleaner avatar

Data Cleaner

parsebird/data-cleaner

Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input.

👁 User avatar

ParseBird

Data Cleaner & Normalizer (JSON/CSV)

zenomastro/data-cleaner-normalizer

Clean and normalize JSON/CSV data: trim whitespace, lowercase emails, normalize phone numbers and dates, drop empty values/rows, and deduplicate by a field.

👁 User avatar

Rosario Vitale

👁 🧼 Scraped Data CSV Cleaner avatar

🧼 Scraped Data CSV Cleaner

taroyamada/csv-data-cleaner

Polish raw outputs from Google Maps and Instagram profile scrapers. Merge duplicate contacts, clear empty spreadsheet rows, and sort email lists automatically.

👁 User avatar

naoki anzai

Filter dataset records

analogous_ottoman/filter-records-based-on-negative-keywords

This actor lets you select a field in your dataset and exclude some records if they contain a keyword in the list of excluded keywords you provide (case-insensitive).

Analogous

👁 Glassdoor Jobs | Remove Duplicate Jobs | Cheapest avatar

Glassdoor Jobs | Remove Duplicate Jobs | Cheapest

cheap_scraper/glassdoor-jobs-scraper-remove-duplicate-jobs

Glassdoor Jobs | Remove Duplicate Jobs | Cheapest The Glassdoor jobs scraper allows you to collect jobs By entering multiple keywords, search queries.

👁 User avatar

cheap_scraper

655

👁 LinkedIn Jobs Scraper | Remove Duplicate Jobs | Pay Per Result avatar

LinkedIn Jobs Scraper | Remove Duplicate Jobs | Pay Per Result

cheap_scraper/linkedin-job-scraper

LinkedIn Jobs Scraper | Remove Duplicate Jobs. The LinkedIn jobs scraper allows you to collect jobs in 2 ways: By providing one or more start URLs, or By entering multiple keywords, search queries. You can use either method individually or combine both.

👁 User avatar

cheap_scraper

8.2K

4.5

Product Matching API

vivid_astronaut/product-matching

👁 User avatar

Fabio Suizu

URL: https://apify.com/parsebird/dataset-deduplicator

⇱ Data Deduplicator — Merge & Deduplicate Apify Datasets · Apify

Data Deduplicator

Data Deduplicator

Copy to your AI assistant

What does Data Deduplicator do?

How to use it (6 steps)

Input parameters

Composite key examples

Output example

How to use via API

Tips and best practices

Pricing

You might also like

Dataset Deduplicator

CSV Diff Tool

Deduplicate, Merge & Transform Datasets

Data Cleaner

Data Cleaner & Normalizer (JSON/CSV)

🧼 Scraped Data CSV Cleaner

Filter dataset records

Glassdoor Jobs | Remove Duplicate Jobs | Cheapest

LinkedIn Jobs Scraper | Remove Duplicate Jobs | Pay Per Result

Product Matching API