Pricing
$0.10 / 1,000 results
Advanced Product Matcher Pro
A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.
Pricing
$0.10 / 1,000 results
Rating
5.0
(1)
Developer
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
4 months ago
Last modified
Categories
Share
AI Product Matcher Actor
A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation.
Features
- Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets
- Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore
- Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching
- Configurable Attributes: Weight different product attributes based on importance
- Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization
- Performance Optimization: Group products by categories or other attributes for faster processing
- Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models
- Flexible Output: Customizable match results with similarity scores, original values, and additional output fields
- Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors
Quick Start
Basic Configuration Example
{"dataFormat":"csv","dataSource":"datasets","dataset1":"catalog_products","dataset1Name":"Catalog","dataset1PrimaryKey":"ProductId","dataset2":"retailer_products","dataset2Name":"Retailer","dataset2PrimaryKey":"ProductId","threshold":0.7,"maxMatches":2,"language":"en","groupByAttribute":"category","csvSeparator":",","includeOriginalValues":true,"attributes":[{"name":"title","weight":1.0,"useForMatching":true},{"name":"brand","weight":0.8,"useForMatching":true},{"name":"price","weight":0.3,"useForMatching":false}]}
Core Input Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
dataFormat | string | Data format: "csv" or "json" | "json" |
dataSource | string | Source type: "datasets" or "keyvaluestore" | "datasets" |
keyValuestoreNameOrId | string | Name or ID of KeyValueStore (if dataSource: keyvaluestore) | none |
dataset1 | string | First dataset key/ID (CSV filename or Dataset ID) | required |
dataset1Name | string | Friendly name for dataset 1 | "Dataset1" |
dataset1PrimaryKey | string | Primary key field name in dataset 1 | "ProductId" |
dataset2 | string | Second dataset key/ID | required |
dataset2Name | string | Friendly name for dataset 2 | "Dataset2" |
dataset2PrimaryKey | string | Primary key field name in dataset 2 | "ProductId" |
threshold | number | Minimum overall similarity score for matches (0.0β1.0) | 0.5 |
maxMatches | integer | Maximum number of matches returned per item | 2 |
language | string | Embedding model selection: "en", "multilingual", "es", "fr", "de", "it", "pt", "nl" | "en" |
groupByAttribute | string | Attribute name to group by for efficient matching (optional) | none |
csvSeparator | string | CSV delimiter (only when dataFormat: csv) | "," |
includeOriginalValues | boolean | Include original attribute values in the output records | true |
dataset1OutputFields | array | Include specific attribute values in the output records from dataset 1 | ["Field1"] |
dataset2OutputFields | array | Include specific attribute values in the output records from dataset 2 | ["Field1", "Field2"] |
attributes | array | Required. List of attribute configurations (see below) | required |
Attribute Configuration
Each attribute in attributes supports:
name(string, required) β Column name (CSV) or attribute key (JSON)weight(number) β Importance weight for matching (higher = more important)useForMatching(boolean) β Whether to include in similarity calculationjsonPath(string) β JSON path expression for nested datawordsToRemove(array) β List of words to strip before matchingwordReplacements(object) β Mapping of terms to replace prior to matchingregex(string) β Regex to apply during preprocessingnormalizationRegex(string) β Regex applied before similarity calculationnormalizationReplacement(string) β Replacement for normalization regex
Text Preprocessing example
{"name":"brand","weight":0.8,"useForMatching":true,"wordsToRemove":["inc","llc","ltd","corp"],"wordReplacements":{"apple":"apple inc","samsung":"samsung electronics"},"regex":"\\b(inc|llc|ltd|corp)\\b","normalizationRegex":"[^a-zA-Z0-9\\s]","normalizationReplacement":""}
| Property | Type | Description |
|---|---|---|
wordsToRemove | array | Words to remove from text |
wordReplacements | object | Word substitution mapping |
regex | string | Regex pattern for text cleaning |
normalizationRegex | string | Regex for similarity calculation normalization |
normalizationReplacement | string | Replacement for normalization regex |
Real-World Examples
1. E-commerce Catalog Matching
{"dataFormat":"csv","dataSource":"datasets","dataset1":"manufacturer_catalog.csv","dataset1Name":"Manufacturer","dataset1PrimaryKey":"ProductId","dataset2":"retailer_inventory.csv","dataset2Name":"Retailer","dataset2PrimaryKey":"ProductId","threshold":0.75,"maxMatches":3,"language":"en","groupByAttribute":"category","attributes":[{"name":"product_name","weight":1.5,"useForMatching":true,"wordsToRemove":["new","original","authentic"],"wordReplacements":{"&":"and","w/":"with"}},{"name":"brand","weight":1.2,"useForMatching":true,"wordsToRemove":["inc","llc","corp"],"wordReplacements":{"apple":"apple inc","hp":"hewlett packard"}},{"name":"model_number","weight":1.8,"useForMatching":true,"normalizationRegex":"[^A-Za-z0-9]","normalizationReplacement":""},{"name":"price","weight":0.3,"useForMatching":false,"regex":"\\D"}]}
2. Fashion Product Matching with Complex JSON
Matching fashion products from different suppliers with nested JSON data:
{"dataFormat":"json","dataSource":"datasets","dataset1":"fashion_supplier_a","dataset1Name":"SupplierA","dataset1PrimaryKey":"ID","dataset2":"fashion_supplier_b","dataset2Name":"SupplierB","dataset2PrimaryKey":"ID","threshold":0.65,"language":"multilingual","maxMatches":2,"attributes":[{"name":"Color","jsonPath":"ProductAttributes[Type=Color].Value","weight":1.5,"useForMatching":true,"wordReplacements":{"gray":"grey","navy":"navy blue"}},{"name":"Size","jsonPath":"ProductAttributes[Type=Size].Value","weight":1.8,"useForMatching":true,"wordsToRemove":["size","us","eu"],"normalizationRegex":"[^0-9XLS]","normalizationReplacement":""},{"name":"Material","jsonPath":"Details.Fabric.Primary","weight":1.2,"useForMatching":true}],"includeOriginalValues":false}
Example 3: Home & Garden Products
{"dataFormat":"json","dataSource":"dataset","dataset1":"bedbath","dataset1Name":"BedBath","dataset1PrimaryKey":"ProductId","dataset2":"overstock","dataset2Name":"Overstock","dataset2PrimaryKey":"ProductId","threshold":"0.6","language":"en","csvSeparator":",","groupByAttribute":"Model","maxMatches":3,"attributes":[{"name":"Model","jsonPath":"AdhocDataAttributes[Name=Model].value","weight":1,"useForMatching":false},{"name":"Color","jsonPath":"AdhocDataAttributes[Name=Color].value","weight":2,"useForMatching":true,"wordReplacements":{"gray":"grey","/":" "}},{"name":"Size","jsonPath":"AdhocDataAttributes[Name=Size].value","weight":3,"useForMatching":true,"regex":"\\D"},{"name":"Shape","jsonPath":"AdhocDataAttributes[Name=Shape].value","weight":1,"useForMatching":true}],"dataset1OutputFields":["Address","ProductName"]}
Advanced Configuration
JSON Path Expressions
- Dot notation:
"product.details.name" - Array search:
"Attributes[Name=Color].Value" - Nested arrays/objects for complex structures
Complex Nested Structures
{"ProductAttributes":[{"Type":"Color","Value":"Red"},{"Type":"Size","Value":"Large"},{"Type":"Material","Value":"Cotton"}],"Details":{"Pricing":{"MSRP":29.99,"Sale":19.99},"Specifications":{"Weight":"2.5 lbs"}}}
Corresponding JSON paths:
- Color:
"ProductAttributes[Type=Color].Value" - Size:
"ProductAttributes[Type=Size].Value" - MSRP:
"Details.Pricing.MSRP" - Weight:
"Details.Specifications.Weight"
Regular Expression Patterns
- Size cleaning: remove non-digits
{"regex": "\\D"} - Model normalization: keep alphanumeric
{"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""} - Price extraction: strip currency symbols
{"regex": "[^0-9.]"}
Size Normalization
{"name":"size","regex":"\\D","normalizationRegex":"[^0-9XLS]","normalizationReplacement":""}
regex: Removes all non-digit characters during preprocessingnormalizationRegex: For similarity calculation, keeps only numbers and X, L, S
Model Number Cleaning
{"name":"model","regex":"\\b(model|version|v\\d+)\\b","normalizationRegex":"[^a-zA-Z0-9]","normalizationReplacement":""}
- Removes common model prefixes
- Normalizes to alphanumeric only for comparison
Price Extraction
{"name":"price","regex":"[^0-9.]","normalizationRegex":"\\$|,","normalizationReplacement":""}
- Extracts numeric price values
- Removes currency symbols and commas
Brand Standardization
{"name":"brand","regex":"\\b(inc|llc|ltd|corp|company)\\b","wordReplacements":{"apple":"apple inc","hp":"hewlett packard","ms":"microsoft"}}
Performance Optimization
- Grouping by attribute reduces NΓM comparisons to subsets
- Note Ensure the group by field if in nested JSON is also included in the attributes
- Use English model (
all-MiniLM-L6-v2) for English-only to speed up - Limit
maxMatchesfor large catalogs - Disable matching (
useForMatching: false) on grouping fields
Grouping Strategy
Use groupByAttribute to partition products into smaller groups:
{"groupByAttribute":"category","attributes":[{"name":"category","weight":0.5,"useForMatching":false}]}
Benefits:
- Reduces comparison matrix size from NΓM to smaller subsets
- Improves processing speed significantly for large datasets
- More accurate matches within similar product categories
Language Model Selection
Choose appropriate models based on your data:
- English:
"en"- Fastest, best for English-only data - Multilingual:
"multilingual"- Slower but handles mixed languages - Specific Languages:
"es","fr","de"- Optimized for specific languages
Output Format
The Actor generates matches with the following structure:
{"Dataset1ProductId":"PROD123","Dataset2ProductId":"SKU456","overallSimilarity":0.85,"titleSimilarity":0.92,"brandSimilarity":1.0,"colorSimilarity":0.75,"Dataset1Title":"Apple iPhone 13 Pro","Dataset2Title":"iPhone 13 Pro - Apple","Dataset1Brand":"Apple","Dataset2Brand":"Apple Inc"}
Reading the SUMMARY
After execution, a SUMMARY record is saved to KeyValueStore containing:
- Total products per dataset
- Number of matches and unique matches
- Match rate
- Model and data format used
- Any collected errors with
type,code,message, andsuggestions
Review this summary to diagnose configuration or data issues quickly.
Best Practices
- Attribute Weighting:
- High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs)
- Medium Weight (0.8-1.2): Important descriptors (brand, title)
- Low Weight (0.3-0.7): Secondary attributes (color, price)
- Threshold Selection:
- High Precision (0.8-0.9): Few false positives, may miss some matches
- Balanced (0.6-0.8): Good balance of precision and recall
- High Recall (0.4-0.6): Catches more matches, requires manual review
- Text Preprocessing:
- Start with simple
wordReplacements - Add
regexfor cleaning patterns - Use
normalizationRegexonly for similarity calculation - Validate on sample data
- Scaling to Large Datasets:
- Always use
groupByAttributewhen > 10,000 items - Adjust
maxMatchesand disable output of original values to reduce output dataset size
- Always use
Troubleshooting & Error Handling
Common Issues
- No matches found
- Lower the
thresholdvalue - Verify attribute names and JSON paths
- Adjust text preprocessing rules
- Lower the
- Too many false positives
- Increase
thresholdto 0.8β0.9 - Add stricter
wordsToRemoveor regex - Increase weights for unique identifiers
- Increase
- Performance bottlenecks
- Enable
groupByAttributefor large datasets - Use the English model for English-only data
- Reduce
maxMatches
- Enable
Error Types
This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY.
| Error Class | Code | Description |
|---|---|---|
| InputValidationError | PME-100 | Schema or type validation failed for actor input |
| DataLoadingError | PME-200 | CSV/JSON file not found, unreadable, or unparseable |
| AttributeConfigError | PME-300 | Issues in the attributes section (missing columns, bad JSON paths, invalid weights) |
| ModelLoadingError | PME-400 | Sentence-Transformer model fetch or cache failure |
| ProcessingError | PME-500 | Failures during matching workflow (e.g., zero vectors, similarity computation errors) |
