Advanced Product Matcher Pro

Pricing

$0.10 / 1,000 results

Advanced Product Matcher Pro

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.

Pricing

$0.10 / 1,000 results

Rating

5.0

(1)

Developer

👁 Whisperers

Whisperers

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

AI Product Matcher Actor

A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation.

Features

Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets
Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore
Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching
Configurable Attributes: Weight different product attributes based on importance
Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization
Performance Optimization: Group products by categories or other attributes for faster processing
Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models
Flexible Output: Customizable match results with similarity scores, original values, and additional output fields
Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors

Quick Start

Basic Configuration Example

{
"dataFormat":"csv",
"dataSource":"datasets",
"dataset1":"catalog_products",
"dataset1Name":"Catalog",
"dataset1PrimaryKey":"ProductId",
"dataset2":"retailer_products",
"dataset2Name":"Retailer",
"dataset2PrimaryKey":"ProductId",
"threshold":0.7,
"maxMatches":2,
"language":"en",
"groupByAttribute":"category",
"csvSeparator":",",
"includeOriginalValues":true,
"attributes":[
{
"name":"title",
"weight":1.0,
"useForMatching":true
},
{
"name":"brand",
"weight":0.8,
"useForMatching":true
},
{
"name":"price",
"weight":0.3,
"useForMatching":false
}
]
}

Core Input Parameters

Parameter	Type	Description	Default
`dataFormat`	string	Data format: `"csv"` or `"json"`	`"json"`
`dataSource`	string	Source type: `"datasets"` or `"keyvaluestore"`	`"datasets"`
`keyValuestoreNameOrId`	string	Name or ID of KeyValueStore (if `dataSource: keyvaluestore`)	none
`dataset1`	string	First dataset key/ID (CSV filename or Dataset ID)	required
`dataset1Name`	string	Friendly name for dataset 1	`"Dataset1"`
`dataset1PrimaryKey`	string	Primary key field name in dataset 1	`"ProductId"`
`dataset2`	string	Second dataset key/ID	required
`dataset2Name`	string	Friendly name for dataset 2	`"Dataset2"`
`dataset2PrimaryKey`	string	Primary key field name in dataset 2	`"ProductId"`
`threshold`	number	Minimum overall similarity score for matches (0.0–1.0)	`0.5`
`maxMatches`	integer	Maximum number of matches returned per item	`2`
`language`	string	Embedding model selection: `"en"`, `"multilingual"`, `"es"`, `"fr"`, `"de"`, `"it"`, `"pt"`, `"nl"`	`"en"`
`groupByAttribute`	string	Attribute name to group by for efficient matching (optional)	none
`csvSeparator`	string	CSV delimiter (only when `dataFormat: csv`)	`","`
`includeOriginalValues`	boolean	Include original attribute values in the output records	`true`
`dataset1OutputFields`	array	Include specific attribute values in the output records from dataset 1	`["Field1"]`
`dataset2OutputFields`	array	Include specific attribute values in the output records from dataset 2	`["Field1", "Field2"]`
`attributes`	array	Required. List of attribute configurations (see below)	required

Attribute Configuration

Each attribute in attributes supports:

name (string, required) — Column name (CSV) or attribute key (JSON)
weight (number) — Importance weight for matching (higher = more important)
useForMatching (boolean) — Whether to include in similarity calculation
jsonPath (string) — JSON path expression for nested data
wordsToRemove (array) — List of words to strip before matching
wordReplacements (object) — Mapping of terms to replace prior to matching
regex (string) — Regex to apply during preprocessing
normalizationRegex (string) — Regex applied before similarity calculation
normalizationReplacement (string) — Replacement for normalization regex

Text Preprocessing example

{
"name":"brand",
"weight":0.8,
"useForMatching":true,
"wordsToRemove":["inc","llc","ltd","corp"],
"wordReplacements":{
"apple":"apple inc",
"samsung":"samsung electronics"
},
"regex":"\\b(inc|llc|ltd|corp)\\b",
"normalizationRegex":"[^a-zA-Z0-9\\s]",
"normalizationReplacement":""
}

Property	Type	Description
`wordsToRemove`	array	Words to remove from text
`wordReplacements`	object	Word substitution mapping
`regex`	string	Regex pattern for text cleaning
`normalizationRegex`	string	Regex for similarity calculation normalization
`normalizationReplacement`	string	Replacement for normalization regex

Real-World Examples

1. E-commerce Catalog Matching

{
"dataFormat":"csv",
"dataSource":"datasets",
"dataset1":"manufacturer_catalog.csv",
"dataset1Name":"Manufacturer",
"dataset1PrimaryKey":"ProductId",
"dataset2":"retailer_inventory.csv",
"dataset2Name":"Retailer",
"dataset2PrimaryKey":"ProductId",
"threshold":0.75,
"maxMatches":3,
"language":"en",
"groupByAttribute":"category",
"attributes":[
{
"name":"product_name",
"weight":1.5,
"useForMatching":true,
"wordsToRemove":["new","original","authentic"],
"wordReplacements":{"&amp;":"and","w/":"with"}
},
{
"name":"brand",
"weight":1.2,
"useForMatching":true,
"wordsToRemove":["inc","llc","corp"],
"wordReplacements":{"apple":"apple inc","hp":"hewlett packard"}
},
{
"name":"model_number",
"weight":1.8,
"useForMatching":true,
"normalizationRegex":"[^A-Za-z0-9]",
"normalizationReplacement":""
},
{
"name":"price",
"weight":0.3,
"useForMatching":false,
"regex":"\\D"
}
]
}

2. Fashion Product Matching with Complex JSON

Matching fashion products from different suppliers with nested JSON data:

{
"dataFormat":"json",
"dataSource":"datasets",
"dataset1":"fashion_supplier_a",
"dataset1Name":"SupplierA",
"dataset1PrimaryKey":"ID",
"dataset2":"fashion_supplier_b",
"dataset2Name":"SupplierB",
"dataset2PrimaryKey":"ID",
"threshold":0.65,
"language":"multilingual",
"maxMatches":2,
"attributes":[
{
"name":"Color",
"jsonPath":"ProductAttributes[Type=Color].Value",
"weight":1.5,
"useForMatching":true,
"wordReplacements":{"gray":"grey","navy":"navy blue"}
},
{
"name":"Size",
"jsonPath":"ProductAttributes[Type=Size].Value",
"weight":1.8,
"useForMatching":true,
"wordsToRemove":["size","us","eu"],
"normalizationRegex":"[^0-9XLS]",
"normalizationReplacement":""
},
{
"name":"Material",
"jsonPath":"Details.Fabric.Primary",
"weight":1.2,
"useForMatching":true
}
],
"includeOriginalValues":false
}

Example 3: Home & Garden Products

{
"dataFormat":"json",
"dataSource":"dataset",
"dataset1":"bedbath",
"dataset1Name":"BedBath",
"dataset1PrimaryKey":"ProductId",
"dataset2":"overstock",
"dataset2Name":"Overstock",
"dataset2PrimaryKey":"ProductId",
"threshold":"0.6",
"language":"en",
"csvSeparator":",",
"groupByAttribute":"Model",
"maxMatches":3,
"attributes":[
{
"name":"Model",
"jsonPath":"AdhocDataAttributes[Name=Model].value",
"weight":1,
"useForMatching":false
},
{
"name":"Color",
"jsonPath":"AdhocDataAttributes[Name=Color].value",
"weight":2,
"useForMatching":true,
"wordReplacements":{
"gray":"grey",
"/":" "
}
},
{
"name":"Size",
"jsonPath":"AdhocDataAttributes[Name=Size].value",
"weight":3,
"useForMatching":true,
"regex":"\\D"
},
{
"name":"Shape",
"jsonPath":"AdhocDataAttributes[Name=Shape].value",
"weight":1,
"useForMatching":true
}
],
"dataset1OutputFields":[
"Address",
"ProductName"
]
}

Advanced Configuration

JSON Path Expressions

Dot notation: "product.details.name"
Array search: "Attributes[Name=Color].Value"
Nested arrays/objects for complex structures

Complex Nested Structures

{
"ProductAttributes":[
{"Type":"Color","Value":"Red"},
{"Type":"Size","Value":"Large"},
{"Type":"Material","Value":"Cotton"}
],
"Details":{
"Pricing":{"MSRP":29.99,"Sale":19.99},
"Specifications":{"Weight":"2.5 lbs"}
}
}

Corresponding JSON paths:

Color: "ProductAttributes[Type=Color].Value"
Size: "ProductAttributes[Type=Size].Value"
MSRP: "Details.Pricing.MSRP"
Weight: "Details.Specifications.Weight"

Regular Expression Patterns

Size cleaning: remove non-digits {"regex": "\\D"}
Model normalization: keep alphanumeric {"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""}
Price extraction: strip currency symbols {"regex": "[^0-9.]"}

Size Normalization

{
"name":"size",
"regex":"\\D",
"normalizationRegex":"[^0-9XLS]",
"normalizationReplacement":""
}

regex: Removes all non-digit characters during preprocessing
normalizationRegex: For similarity calculation, keeps only numbers and X, L, S

Model Number Cleaning

{
"name":"model",
"regex":"\\b(model|version|v\\d+)\\b",
"normalizationRegex":"[^a-zA-Z0-9]",
"normalizationReplacement":""
}

Removes common model prefixes
Normalizes to alphanumeric only for comparison

Price Extraction

{
"name":"price",
"regex":"[^0-9.]",
"normalizationRegex":"\\$|,",
"normalizationReplacement":""
}

Extracts numeric price values
Removes currency symbols and commas

Brand Standardization

{
"name":"brand",
"regex":"\\b(inc|llc|ltd|corp|company)\\b",
"wordReplacements":{
"apple":"apple inc",
"hp":"hewlett packard",
"ms":"microsoft"
}
}

Performance Optimization

Grouping by attribute reduces N×M comparisons to subsets
- Note Ensure the group by field if in nested JSON is also included in the attributes
Use English model (all-MiniLM-L6-v2) for English-only to speed up
Limit maxMatches for large catalogs
Disable matching (useForMatching: false) on grouping fields

Grouping Strategy

Use groupByAttribute to partition products into smaller groups:

{
"groupByAttribute":"category",
"attributes":[
{
"name":"category",
"weight":0.5,
"useForMatching":false
}
]
}

Benefits:

Reduces comparison matrix size from N×M to smaller subsets
Improves processing speed significantly for large datasets
More accurate matches within similar product categories

Language Model Selection

Choose appropriate models based on your data:

English: "en" - Fastest, best for English-only data
Multilingual: "multilingual" - Slower but handles mixed languages
Specific Languages: "es", "fr", "de" - Optimized for specific languages

Output Format

The Actor generates matches with the following structure:

{
"Dataset1ProductId":"PROD123",
"Dataset2ProductId":"SKU456",
"overallSimilarity":0.85,
"titleSimilarity":0.92,
"brandSimilarity":1.0,
"colorSimilarity":0.75,
"Dataset1Title":"Apple iPhone 13 Pro",
"Dataset2Title":"iPhone 13 Pro - Apple",
"Dataset1Brand":"Apple",
"Dataset2Brand":"Apple Inc"
}

Reading the SUMMARY

After execution, a SUMMARY record is saved to KeyValueStore containing:

Total products per dataset
Number of matches and unique matches
Match rate
Model and data format used
Any collected errors with type, code, message, and suggestions

Review this summary to diagnose configuration or data issues quickly.

Best Practices

Attribute Weighting:
- High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs)
- Medium Weight (0.8-1.2): Important descriptors (brand, title)
- Low Weight (0.3-0.7): Secondary attributes (color, price)
Threshold Selection:
- High Precision (0.8-0.9): Few false positives, may miss some matches
- Balanced (0.6-0.8): Good balance of precision and recall
- High Recall (0.4-0.6): Catches more matches, requires manual review
Text Preprocessing:

Start with simple wordReplacements
Add regex for cleaning patterns
Use normalizationRegex only for similarity calculation
Validate on sample data

Scaling to Large Datasets:
- Always use groupByAttribute when > 10,000 items
- Adjust maxMatches and disable output of original values to reduce output dataset size

Troubleshooting & Error Handling

Common Issues

No matches found
- Lower the threshold value
- Verify attribute names and JSON paths
- Adjust text preprocessing rules
Too many false positives
- Increase threshold to 0.8–0.9
- Add stricter wordsToRemove or regex
- Increase weights for unique identifiers
Performance bottlenecks
- Enable groupByAttribute for large datasets
- Use the English model for English-only data
- Reduce maxMatches

Error Types

This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY.

Error Class	Code	Description
InputValidationError	PME-100	Schema or type validation failed for actor input
DataLoadingError	PME-200	CSV/JSON file not found, unreadable, or unparseable
AttributeConfigError	PME-300	Issues in the `attributes` section (missing columns, bad JSON paths, invalid weights)
ModelLoadingError	PME-400	Sentence-Transformer model fetch or cache failure
ProcessingError	PME-500	Failures during matching workflow (e.g., zero vectors, similarity computation errors)

👁 Content Similarity Finder avatar

Content Similarity Finder

fiery_dream/content-similarity-finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

👁 User avatar

Cody Churchwell

👁 E-commerce Product Matching Tool avatar

E-commerce Product Matching Tool

tri_angle/e-commerce-product-matching-tool

Match products across e-commerce datasets with E-Commerce Product Matching Tool. Use it with E-commerce Scraping Tool datasets to automatically find identical and similar products and power price monitoring or catalog comparison.

👁 User avatar

Tri⟁angle

👁 AI Product Matcher avatar

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

👁 User avatar

Matěj Sochor

771

Image Comparator

noisy_alchemy/image-comparator

Compare a source image against multiple targets using CLIP deep-learning model to determine visual similarity. Ideal for e-commerce matching, copyright detection, and image deduplication. Accepts both URLs and Base64 encoded images to provide highly accurate similarity scoring.

👁 User avatar

Aljeandro

5.0

👁 CRM Deduplication Tool avatar

CRM Deduplication Tool

enosgb/crm-deduplication-tool

Detects and merges duplicate contacts in CRM databases using advanced fuzzy matching algorithms

👁 User avatar

Enos Melo

Product Matching API

vivid_astronaut/product-matching

👁 User avatar

Fabio Suizu

👁 E-commerce Email Scraper 🔍🛒📧 - Cheap & Advanced avatar

E-commerce Email Scraper 🔍🛒📧 - Cheap & Advanced

scrapestorm/e-commerce-email-scraper---cheap-advanced

🔍 Scrape E-commerce Emails Easily Enter your search parameters (e.g product keywords, email domains & platform) to collect verified seller or store contacts along with product title, store description & more 📊 Perfect for e-commerce lead generation, B2B outreach, product research & market analysis

👁 User avatar

Storm_Scraper

119

5.0

👁 Advanced Ebay Scraper – Extract Product Data, Prices & Reviews avatar

Advanced Ebay Scraper – Extract Product Data, Prices & Reviews

sovanza.inc/advanced-ebay-scraper-extract-product-data-prices-reviews

The eBay Product Scraper is a powerful Apify actor designed to extract detailed product data from eBay listings, including price, images, seller information, product variants, and reviews. It is ideal for e-commerce research, competitor analysis, and price monitoring.

👁 User avatar

Sovanza

5.0

👁 Trustpilot Scraper Pro avatar

Trustpilot Scraper Pro

coder_zoro/Trustpilot-Scraper-Pro

Trustpilot Scraper Pro is a powerful Apify actor that extracts detailed business information and customer reviews from Trustpilot. Choose between two modes: scrape company data (name, rating, contact, etc.) or collect reviews with filters.

👁 User avatar

Zoro

206

4.7

Product Catalog API

vivid_astronaut/product-catalog

👁 User avatar

Fabio Suizu

👁 Blog article image

Product matching AI: pricing intelligence powered by web scraping

URL: https://apify.com/datawhisperers/advanced-product-matcher-pro