VOOZH about

URL: https://apify.com/moving_beacon-owner1/my-actor-66

⇱ Data Cleaning & Transformation Toolkit Β· Apify


πŸ‘ Data Cleaning & Transformation Toolkit avatar

Data Cleaning & Transformation Toolkit

Pricing

from $10.00 / 1,000 results

Go to Apify Store

Data Cleaning & Transformation Toolkit

A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON β€” ready for APIs, databases, or downstream processing.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

πŸ‘ Jamshaid Arif

Jamshaid Arif

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 months ago

Last modified

Share

🧹 Data Cleaning & Transformation Toolkit β€” Apify Actor

A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON β€” ready for APIs, databases, or downstream processing.


🎯 What It Does

ModeInputOutput
Messy TextInconsistent text with mixed delimitersClean JSON records
Excel / CSV.xlsx or .csv file URLAPI-ready JSON with metadata
HTML ScrapeRaw HTML or live URLsStructured dataset (tables, elements, links)
Key-Value.env, .ini, logs, YAML-like textParsed JSON object or records
URL FetchAny webpage URLAuto-extracted structured data

πŸš€ Quick Start Examples

1. Messy Text β†’ JSON

{
"mode":"messy_text",
"inputText":"Name: John Doe | Age: 29 | City: New York\nName: Jane Smith | Age: 32 | City: LA",
"textParseStrategy":"auto",
"outputFormat":"records"
}

Output:

[
{"name":"John Doe","age":29,"city":"New York"},
{"name":"Jane Smith","age":32,"city":"LA"}
]

2. Excel / CSV β†’ API-Ready JSON

{
"mode":"excel_csv",
"fileUrl":"https://example.com/data/sales_report.xlsx",
"sheetName":"Q1",
"skipEmptyRows":true,
"outputFormat":"wrapped"
}

Output:

{
"meta":{
"source":"sales_report.xlsx",
"sheet":"Q1",
"total_records":150,
"columns":["id","product","revenue"],
"generated_at":"2026-04-01T12:00:00"
},
"data":[
{"id":1,"product":"Widget A","revenue":9999.50}
]
}

3. Scrape a Website

{
"mode":"html_scrape",
"urls":[{"url":"https://books.toscrape.com/"}],
"htmlExtractMode":"elements",
"cssSelector":"article.product_pod",
"fieldMap":"{\"title\": \"h3 a\", \"price\": \".price_color\"}"
}

4. Parse Config Files

{
"mode":"key_value",
"inputText":"[database]\nhost = localhost\nport = 5432\n\n[cache]\ndriver = redis\nttl = 3600",
"kvFormat":"auto",
"outputFormat":"flat"
}

5. Auto-Extract from URL

{
"mode":"url_fetch",
"urls":[{"url":"https://en.wikipedia.org/wiki/Web_scraping"}],
"outputFormat":"records"
}

βš™οΈ Input Schema Reference

Core Settings

FieldTypeDefaultDescription
modeenummessy_textTransformation mode
inputTextstring(sample data)Raw text input
fileUrlstring""URL to download a file from
urlsarray[{url: "https://books.toscrape.com/"}]URLs to scrape
outputFormatenumrecordsOutput structure: records, wrapped, or flat

Text Mode Options

FieldTypeDefaultDescription
textParseStrategyenumautoauto, delimited, key_value, or block

Key-Value Mode Options

FieldTypeDefaultDescription
kvFormatenumautoauto, env, ini, log, or yaml

HTML Mode Options

FieldTypeDefaultDescription
htmlExtractModeenumautotables, elements, links, text, or auto
cssSelectorstringarticle.product_podCSS selector for repeating elements
fieldMapJSON string(book fields)Maps output keys to CSS selectors

Excel Mode Options

FieldTypeDefaultDescription
sheetNamestring""Specific sheet (empty = all)
skipEmptyRowsbooleantrueRemove blank rows
forwardFillColumnsstring""Comma-separated columns to forward-fill
pageSizeinteger0Records per page (0 = no pagination)

Network Options

FieldTypeDefaultDescription
proxyConfigurationobject{useApifyProxy: true}Proxy settings
maxRequestRetriesinteger3Max retries for HTTP requests

πŸ“€ Output Formats

records (default)

Each extracted record becomes its own row in the Apify dataset. Best for large datasets and downstream processing.

wrapped

A single dataset entry with meta (source info, column names, timestamps) and data (array of records). Best for API responses.

flat

Outputs the parsed object directly. Ideal for config file parsing where you want a single JSON object.


🧠 Smart Features

  • Auto-Detection: Every mode has an auto strategy that detects the input format
  • Type Casting: Strings like "42", "true", "null" are automatically cast to native types
  • Key Normalization: All field names are converted to snake_case
  • Merged Cell Handling: Forward-fill support for Excel files with merged cells
  • Pagination: Built-in page/pageSize support for large Excel datasets
  • Metadata: Wrapped output includes source file, column names, timestamps, and record counts
  • Error Resilience: Failed URLs are logged with _error fields instead of crashing

πŸ“‚ Project Structure

apify-actor/
β”œβ”€β”€ .actor/
β”‚ β”œβ”€β”€ actor.json # Actor configuration
β”‚ └── input_schema.json # Input schema with defaults
β”œβ”€β”€ main.py # Actor entry point
β”œβ”€β”€ data_transformer.py # Core transformation engine
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ Dockerfile # Container build instructions
└── README.md # This file

πŸ—οΈ Local Development

# Install dependencies
pip install-r requirements.txt
# Run locally with Apify CLI
apify run --input='{"mode": "messy_text", "inputText": "Name: Alice | Age: 30"}'

πŸ“œ License

ISC

You might also like

Text-to-JSON Structured Extractor

moving_beacon-owner1/my-actor-68

A versatile Apify actor that converts unstructured text and HTML into clean, structured JSON. Supports four extraction modes with auto-detection, URL fetching, and batch processing.

2

Structured Data Crawler

tempting_district/structured-data-crawler

Crawl public web pages and convert unstructured content into clean, deterministic, schema-first structured records.

πŸ”₯ AI HTML to JSON Extractor (Fast, Free LLM for Data)

autoscaler/ai-html-to-json-extractor

Eliminate messy HTML cleanup and high LLM costs. This Actor uses a high-speed, zero-cost large language model to turn unstructured content (HTML, text, reviews, blog posts) into valid, structured JSON.

Code Converter Toolkit

moving_beacon-owner1/my-actor-64

A universal code conversion actor that transforms between 6 popular code formats in a single run. Supports both single and batch conversions with structured JSON output.

2

JSON To XML Converter

zsoftware/json-to-xml-converter

Easily convert structured JSON data into well-formed XML. This actor accepts raw JSON text or a file and outputs clean, standards-compliant XMLβ€”perfect for data transformation pipelines, integrations, or legacy system compatibility.

Image To Text Ai

welcoming_fireplace/image-to-text-ai

A powerful OCR tool that goes beyond standard text extraction. Powered by a Premium Vision AI model, it accurately reads handwriting, preserves table structures, and converts messy receipts or documents into structured JSON or Markdown. Supports batch processing for high-volume workflows.

πŸ‘ User avatar

Richmond Nkrumah

42

Linkedin Jobs Scraper

dead00/linkedin-jobs-scraper

A LinkedIn job scraper this scraper extracts comprehensive job listings from LinkedIn with advanced data processing and cleaning capabilities.

Python Web Scraping Toolkit

fipper_ai/Python-web-scraping-toolkit

Contact Info Scraper with Emails and Phones

intelecta/fast-contact-info-scraper-with-emails

A powerful Apify actor that scrapes emails, phone numbers, and social media profiles from a list of websites, following internal links for thorough contact extraction. Ideal for lead generation, research, and building structured contact databases.

158

3.3

apx-toolkit

irun2themoney/apx-toolkit

Discover APIs. Generate code. Save weeks of work.

Related articles

How to monetize your API (and get new users)
Read more
What data does AI use?
Read more
Data collection services: Why Apify leads the way
Read more