Pricing
from $1.49 / 1,000 items cleaneds
Data Cleaner
Clean messy data โ remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input.
Pricing
from $1.49 / 1,000 items cleaneds
Rating
0.0
(0)
Developer
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
24 days ago
Last modified
Categories
Share
Data Cleaner
Clean messy data โ remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input. The first general-purpose data cleaner on Apify.
| Clean messy scraped data in one step โ trim whitespace, normalize casing, format phone numbers to E.164, lowercase emails, extract domains from URLs, convert strings to numbers, remove null rows, and deduplicate. |
| ParseBird Infra Suite โข Utility tools for data pipelines | ||
|
๐ HTTP Request Send API calls from the cloud |
๐ Data Deduplicator Merge & deduplicate datasets by any field |
๐ก Data Cleaner โค You are here |
Copy to your AI assistant
Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.
parsebird/data-cleaner on Apify. Call: ApifyClient("TOKEN").actor("parsebird/data-cleaner").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for cleaned results. Key inputs: datasetId (string, Apify dataset ID), jsonData (array of objects, direct JSON input), operations (array of {field, action, options} โ required), outputDatasetId (string, optional), maxItems (integer, default 1000000). Actions: trim_whitespace, normalize_case (options: {case: "lower"|"upper"|"title"}), format_email, format_phone (options: {countryCode: "US"}), extract_domain, to_number, to_date, fill_nulls (options: {value: "..."}), remove_nulls, remove_duplicates, replace_value (options: {find, replace}). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~data-cleaner (Bearer TOKEN). Get token: https://console.apify.com/account/integrations
What does Data Cleaner do?
This Actor takes messy scraped or imported data and applies a configurable pipeline of cleaning operations. Each operation targets a specific field and transforms its values โ trimming whitespace, normalizing case, formatting phone numbers, and more.
Use cases:
- CRM cleanup โ normalize names, emails, and phone numbers before import
- Lead list hygiene โ remove rows with missing emails, deduplicate by company
- Post-scrape processing โ extract domains from URLs, convert price strings to numbers
- Data pipeline prep โ standardize data format before analysis or export
Supported operations
| Action | Description | Options | Before | After |
|---|---|---|---|---|
trim_whitespace | Remove leading/trailing spaces | โ | " John Doe " | "John Doe" |
normalize_case | Convert to lower/upper/title case | {"case": "title"} | "john doe" | "John Doe" |
format_email | Lowercase and trim emails | โ | " JOHN@CO.COM " | "john@co.com" |
format_phone | Normalize to E.164 format | {"countryCode": "US"} | "(555) 123-4567" | "+15551234567" |
extract_domain | Extract domain from URL or email | โ | "https://www.example.com/page" | "example.com" |
to_number | Convert string to number | โ | "$1,234,567" | 1234567 |
to_date | Parse date to ISO 8601 | โ | "March 15, 2024" | "2024-03-15T00:00:00" |
fill_nulls | Replace null/empty with default | {"value": "N/A"} | null | "N/A" |
remove_nulls | Remove rows where field is null/empty | โ | (row removed) | โ |
remove_duplicates | Deduplicate by this field | โ | (duplicate removed) | โ |
replace_value | Find and replace text | {"find": "Inc.", "replace": "Inc"} | "Acme Inc." | "Acme Inc" |
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
datasetId | string | No* | โ | Apify dataset ID to clean |
jsonData | array | No* | โ | Direct JSON array of objects to clean |
operations | array | Yes | โ | List of {field, action, options} cleaning operations |
outputDatasetId | string | No | โ | Named output dataset (defaults to run dataset) |
maxItems | integer | No | 1000000 | Max items to process |
*Provide either datasetId or jsonData (or both).
Operations format
Each operation is a JSON object with:
{"field":"email","action":"format_email","options":{}}
Operations are applied in order. You can chain multiple operations on the same field:
[{"field":"name","action":"trim_whitespace"},{"field":"name","action":"normalize_case","options":{"case":"title"}},{"field":"email","action":"format_email"},{"field":"phone","action":"format_phone","options":{"countryCode":"US"}},{"field":"website","action":"extract_domain"},{"field":"revenue","action":"to_number"},{"field":"email","action":"remove_nulls"}]
Before and after example
Input (dirty data)
[{"name":" john doe ","email":" JOHN@EXAMPLE.COM ","phone":"(555) 123-4567","website":"https://www.example.com/about","revenue":"$1,234,567"},{"name":"JANE SMITH","email":"Jane.Smith@Company.IO","phone":"555.987.6543","website":"info@company.io","revenue":"2345678"},{"name":"","email":null,"phone":"1-800-555-0199","website":"company.io","revenue":"$99.99"},{"name":"bob wilson","email":"bob@test.com","phone":"+14155550100","website":"https://test.com/page?id=1","revenue":"not a number"}]
Output (cleaned data)
[{"name":"John Doe","email":"john@example.com","phone":"+15551234567","website":"example.com","revenue":1234567},{"name":"Jane Smith","email":"jane.smith@company.io","phone":"+15559876543","website":"company.io","revenue":2345678},{"name":"Bob Wilson","email":"bob@test.com","phone":"+14155550100","website":"test.com","revenue":"not a number"}]
Row 3 was removed (null email with remove_nulls). All names are title-cased, emails lowercased, phones in E.164, domains extracted, and revenues converted to numbers.
How to use via API
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("parsebird/data-cleaner").call(run_input={"datasetId":"YOUR_DATASET_ID","operations":[{"field":"email","action":"format_email"},{"field":"name","action":"trim_whitespace"},{"field":"name","action":"normalize_case","options":{"case":"title"}},{"field":"phone","action":"format_phone","options":{"countryCode":"US"}},],})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"Cleaned items: {len(items)}")
cURL
curl-X POST "https://api.apify.com/v2/acts/parsebird~data-cleaner/runs?token=YOUR_API_TOKEN"\-H"Content-Type: application/json"\-d'{"jsonData": [{"name": " JOHN DOE ", "email": " JOHN@CO.COM "}],"operations": [{"field": "name", "action": "trim_whitespace"},{"field": "name", "action": "normalize_case", "options": {"case": "title"}},{"field": "email", "action": "format_email"}]}'
Output
Cleaned items retain their original structure. A stats key is stored in the key-value store:
{"totalLoaded":5000,"totalCleaned":4800,"operationsApplied":7,"fieldsCleaned":5,"totalChanges":15200}
Pricing
This Actor uses a pay-per-event pricing model.
| Event | Price per event | Price per 1,000 |
|---|---|---|
items-cleaned | $0.00149 | $1.49 |
Charged per 1,000 items loaded. Platform compute costs are additional.
