VOOZH about

URL: https://apify.com/datawhisperers/advanced-product-matcher-pro

⇱ Advanced Product Matcher Pro Β· Apify


Pricing

$0.10 / 1,000 results

Go to Apify Store

Advanced Product Matcher Pro

A powerful AI Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication, and inventory reconciliation.

Pricing

$0.10 / 1,000 results

Rating

5.0

(1)

Developer

πŸ‘ Whisperers

Whisperers

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

4 months ago

Last modified

Share

AI Product Matcher Actor

A powerful Apify Actor that intelligently matches products between two datasets using advanced machine learning algorithms and configurable similarity scoring. Perfect for e-commerce catalog matching, product deduplication and inventory reconciliation.

Features

  • Multi-format Support: Works with both CSV files (KeyValueStore) and JSON datasets
  • Flexible Data Sources: Load data directly from Apify Datasets or KeyValueStore
  • Intelligent Matching: Uses Sentence Transformers and cosine similarity for semantic product matching
  • Configurable Attributes: Weight different product attributes based on importance
  • Text Preprocessing: Built-in word removal, replacement, regex cleaning, and normalization
  • Performance Optimization: Group products by categories or other attributes for faster processing
  • Multilingual Support: Supports English, Spanish, French, German, Italian, Portuguese, Dutch, and multilingual models
  • Flexible Output: Customizable match results with similarity scores, original values, and additional output fields
  • Error Reporting: Structured error types for input validation, data loading, attribute configuration, model loading, and processing errors

Quick Start

Basic Configuration Example

{
"dataFormat":"csv",
"dataSource":"datasets",
"dataset1":"catalog_products",
"dataset1Name":"Catalog",
"dataset1PrimaryKey":"ProductId",
"dataset2":"retailer_products",
"dataset2Name":"Retailer",
"dataset2PrimaryKey":"ProductId",
"threshold":0.7,
"maxMatches":2,
"language":"en",
"groupByAttribute":"category",
"csvSeparator":",",
"includeOriginalValues":true,
"attributes":[
{
"name":"title",
"weight":1.0,
"useForMatching":true
},
{
"name":"brand",
"weight":0.8,
"useForMatching":true
},
{
"name":"price",
"weight":0.3,
"useForMatching":false
}
]
}

Core Input Parameters

ParameterTypeDescriptionDefault
dataFormatstringData format: "csv" or "json""json"
dataSourcestringSource type: "datasets" or "keyvaluestore""datasets"
keyValuestoreNameOrIdstringName or ID of KeyValueStore (if dataSource: keyvaluestore)none
dataset1stringFirst dataset key/ID (CSV filename or Dataset ID)required
dataset1NamestringFriendly name for dataset 1"Dataset1"
dataset1PrimaryKeystringPrimary key field name in dataset 1"ProductId"
dataset2stringSecond dataset key/IDrequired
dataset2NamestringFriendly name for dataset 2"Dataset2"
dataset2PrimaryKeystringPrimary key field name in dataset 2"ProductId"
thresholdnumberMinimum overall similarity score for matches (0.0–1.0)0.5
maxMatchesintegerMaximum number of matches returned per item2
languagestringEmbedding model selection: "en", "multilingual", "es", "fr", "de", "it", "pt", "nl""en"
groupByAttributestringAttribute name to group by for efficient matching (optional)none
csvSeparatorstringCSV delimiter (only when dataFormat: csv)","
includeOriginalValuesbooleanInclude original attribute values in the output recordstrue
dataset1OutputFieldsarrayInclude specific attribute values in the output records from dataset 1["Field1"]
dataset2OutputFieldsarrayInclude specific attribute values in the output records from dataset 2["Field1", "Field2"]
attributesarrayRequired. List of attribute configurations (see below)required

Attribute Configuration

Each attribute in attributes supports:

  • name (string, required) β€” Column name (CSV) or attribute key (JSON)
  • weight (number) β€” Importance weight for matching (higher = more important)
  • useForMatching (boolean) β€” Whether to include in similarity calculation
  • jsonPath (string) β€” JSON path expression for nested data
  • wordsToRemove (array) β€” List of words to strip before matching
  • wordReplacements (object) β€” Mapping of terms to replace prior to matching
  • regex (string) β€” Regex to apply during preprocessing
  • normalizationRegex (string) β€” Regex applied before similarity calculation
  • normalizationReplacement (string) β€” Replacement for normalization regex

Text Preprocessing example

{
"name":"brand",
"weight":0.8,
"useForMatching":true,
"wordsToRemove":["inc","llc","ltd","corp"],
"wordReplacements":{
"apple":"apple inc",
"samsung":"samsung electronics"
},
"regex":"\\b(inc|llc|ltd|corp)\\b",
"normalizationRegex":"[^a-zA-Z0-9\\s]",
"normalizationReplacement":""
}
PropertyTypeDescription
wordsToRemovearrayWords to remove from text
wordReplacementsobjectWord substitution mapping
regexstringRegex pattern for text cleaning
normalizationRegexstringRegex for similarity calculation normalization
normalizationReplacementstringReplacement for normalization regex

Real-World Examples

1. E-commerce Catalog Matching

{
"dataFormat":"csv",
"dataSource":"datasets",
"dataset1":"manufacturer_catalog.csv",
"dataset1Name":"Manufacturer",
"dataset1PrimaryKey":"ProductId",
"dataset2":"retailer_inventory.csv",
"dataset2Name":"Retailer",
"dataset2PrimaryKey":"ProductId",
"threshold":0.75,
"maxMatches":3,
"language":"en",
"groupByAttribute":"category",
"attributes":[
{
"name":"product_name",
"weight":1.5,
"useForMatching":true,
"wordsToRemove":["new","original","authentic"],
"wordReplacements":{"&":"and","w/":"with"}
},
{
"name":"brand",
"weight":1.2,
"useForMatching":true,
"wordsToRemove":["inc","llc","corp"],
"wordReplacements":{"apple":"apple inc","hp":"hewlett packard"}
},
{
"name":"model_number",
"weight":1.8,
"useForMatching":true,
"normalizationRegex":"[^A-Za-z0-9]",
"normalizationReplacement":""
},
{
"name":"price",
"weight":0.3,
"useForMatching":false,
"regex":"\\D"
}
]
}

2. Fashion Product Matching with Complex JSON

Matching fashion products from different suppliers with nested JSON data:

{
"dataFormat":"json",
"dataSource":"datasets",
"dataset1":"fashion_supplier_a",
"dataset1Name":"SupplierA",
"dataset1PrimaryKey":"ID",
"dataset2":"fashion_supplier_b",
"dataset2Name":"SupplierB",
"dataset2PrimaryKey":"ID",
"threshold":0.65,
"language":"multilingual",
"maxMatches":2,
"attributes":[
{
"name":"Color",
"jsonPath":"ProductAttributes[Type=Color].Value",
"weight":1.5,
"useForMatching":true,
"wordReplacements":{"gray":"grey","navy":"navy blue"}
},
{
"name":"Size",
"jsonPath":"ProductAttributes[Type=Size].Value",
"weight":1.8,
"useForMatching":true,
"wordsToRemove":["size","us","eu"],
"normalizationRegex":"[^0-9XLS]",
"normalizationReplacement":""
},
{
"name":"Material",
"jsonPath":"Details.Fabric.Primary",
"weight":1.2,
"useForMatching":true
}
],
"includeOriginalValues":false
}

Example 3: Home & Garden Products

{
"dataFormat":"json",
"dataSource":"dataset",
"dataset1":"bedbath",
"dataset1Name":"BedBath",
"dataset1PrimaryKey":"ProductId",
"dataset2":"overstock",
"dataset2Name":"Overstock",
"dataset2PrimaryKey":"ProductId",
"threshold":"0.6",
"language":"en",
"csvSeparator":",",
"groupByAttribute":"Model",
"maxMatches":3,
"attributes":[
{
"name":"Model",
"jsonPath":"AdhocDataAttributes[Name=Model].value",
"weight":1,
"useForMatching":false
},
{
"name":"Color",
"jsonPath":"AdhocDataAttributes[Name=Color].value",
"weight":2,
"useForMatching":true,
"wordReplacements":{
"gray":"grey",
"/":" "
}
},
{
"name":"Size",
"jsonPath":"AdhocDataAttributes[Name=Size].value",
"weight":3,
"useForMatching":true,
"regex":"\\D"
},
{
"name":"Shape",
"jsonPath":"AdhocDataAttributes[Name=Shape].value",
"weight":1,
"useForMatching":true
}
],
"dataset1OutputFields":[
"Address",
"ProductName"
]
}

Advanced Configuration

JSON Path Expressions

  • Dot notation: "product.details.name"
  • Array search: "Attributes[Name=Color].Value"
  • Nested arrays/objects for complex structures

Complex Nested Structures

{
"ProductAttributes":[
{"Type":"Color","Value":"Red"},
{"Type":"Size","Value":"Large"},
{"Type":"Material","Value":"Cotton"}
],
"Details":{
"Pricing":{"MSRP":29.99,"Sale":19.99},
"Specifications":{"Weight":"2.5 lbs"}
}
}

Corresponding JSON paths:

  • Color: "ProductAttributes[Type=Color].Value"
  • Size: "ProductAttributes[Type=Size].Value"
  • MSRP: "Details.Pricing.MSRP"
  • Weight: "Details.Specifications.Weight"

Regular Expression Patterns

  • Size cleaning: remove non-digits {"regex": "\\D"}
  • Model normalization: keep alphanumeric {"normalizationRegex": "[^A-Za-z0-9]", "normalizationReplacement": ""}
  • Price extraction: strip currency symbols {"regex": "[^0-9.]"}

Size Normalization

{
"name":"size",
"regex":"\\D",
"normalizationRegex":"[^0-9XLS]",
"normalizationReplacement":""
}
  • regex: Removes all non-digit characters during preprocessing
  • normalizationRegex: For similarity calculation, keeps only numbers and X, L, S

Model Number Cleaning

{
"name":"model",
"regex":"\\b(model|version|v\\d+)\\b",
"normalizationRegex":"[^a-zA-Z0-9]",
"normalizationReplacement":""
}
  • Removes common model prefixes
  • Normalizes to alphanumeric only for comparison

Price Extraction

{
"name":"price",
"regex":"[^0-9.]",
"normalizationRegex":"\\$|,",
"normalizationReplacement":""
}
  • Extracts numeric price values
  • Removes currency symbols and commas

Brand Standardization

{
"name":"brand",
"regex":"\\b(inc|llc|ltd|corp|company)\\b",
"wordReplacements":{
"apple":"apple inc",
"hp":"hewlett packard",
"ms":"microsoft"
}
}

Performance Optimization

  • Grouping by attribute reduces NΓ—M comparisons to subsets
    • Note Ensure the group by field if in nested JSON is also included in the attributes
  • Use English model (all-MiniLM-L6-v2) for English-only to speed up
  • Limit maxMatches for large catalogs
  • Disable matching (useForMatching: false) on grouping fields

Grouping Strategy

Use groupByAttribute to partition products into smaller groups:

{
"groupByAttribute":"category",
"attributes":[
{
"name":"category",
"weight":0.5,
"useForMatching":false
}
]
}

Benefits:

  • Reduces comparison matrix size from NΓ—M to smaller subsets
  • Improves processing speed significantly for large datasets
  • More accurate matches within similar product categories

Language Model Selection

Choose appropriate models based on your data:

  • English: "en" - Fastest, best for English-only data
  • Multilingual: "multilingual" - Slower but handles mixed languages
  • Specific Languages: "es", "fr", "de" - Optimized for specific languages

Output Format

The Actor generates matches with the following structure:

{
"Dataset1ProductId":"PROD123",
"Dataset2ProductId":"SKU456",
"overallSimilarity":0.85,
"titleSimilarity":0.92,
"brandSimilarity":1.0,
"colorSimilarity":0.75,
"Dataset1Title":"Apple iPhone 13 Pro",
"Dataset2Title":"iPhone 13 Pro - Apple",
"Dataset1Brand":"Apple",
"Dataset2Brand":"Apple Inc"
}

Reading the SUMMARY

After execution, a SUMMARY record is saved to KeyValueStore containing:

  • Total products per dataset
  • Number of matches and unique matches
  • Match rate
  • Model and data format used
  • Any collected errors with type, code, message, and suggestions

Review this summary to diagnose configuration or data issues quickly.

Best Practices

  • Attribute Weighting:
    • High Weight (1.5-2.0): Unique identifiers (model numbers, SKUs)
    • Medium Weight (0.8-1.2): Important descriptors (brand, title)
    • Low Weight (0.3-0.7): Secondary attributes (color, price)
  • Threshold Selection:
    • High Precision (0.8-0.9): Few false positives, may miss some matches
    • Balanced (0.6-0.8): Good balance of precision and recall
    • High Recall (0.4-0.6): Catches more matches, requires manual review
  • Text Preprocessing:
  1. Start with simple wordReplacements
  2. Add regex for cleaning patterns
  3. Use normalizationRegex only for similarity calculation
  4. Validate on sample data
  • Scaling to Large Datasets:
    • Always use groupByAttribute when > 10,000 items
    • Adjust maxMatches and disable output of original values to reduce output dataset size

Troubleshooting & Error Handling

Common Issues

  • No matches found
    • Lower the threshold value
    • Verify attribute names and JSON paths
    • Adjust text preprocessing rules
  • Too many false positives
    • Increase threshold to 0.8–0.9
    • Add stricter wordsToRemove or regex
    • Increase weights for unique identifiers
  • Performance bottlenecks
    • Enable groupByAttribute for large datasets
    • Use the English model for English-only data
    • Reduce maxMatches

Error Types

This Actor uses structured error classes to surface actionable messages and suggestions. All errors are collected in the final SUMMARY.

Error ClassCodeDescription
InputValidationErrorPME-100Schema or type validation failed for actor input
DataLoadingErrorPME-200CSV/JSON file not found, unreadable, or unparseable
AttributeConfigErrorPME-300Issues in the attributes section (missing columns, bad JSON paths, invalid weights)
ModelLoadingErrorPME-400Sentence-Transformer model fetch or cache failure
ProcessingErrorPME-500Failures during matching workflow (e.g., zero vectors, similarity computation errors)

You might also like

Content Similarity Finder

fiery_dream/content-similarity-finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

πŸ‘ User avatar

Cody Churchwell

2

E-commerce Product Matching Tool

tri_angle/e-commerce-product-matching-tool

Match products across e-commerce datasets with E-Commerce Product Matching Tool. Use it with E-commerce Scraping Tool datasets to automatically find identical and similar products and power price monitoring or catalog comparison.

πŸ‘ User avatar

Tri⟁angle

3

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

πŸ‘ User avatar

MatΔ›j Sochor

771

CRM Deduplication Tool

enosgb/crm-deduplication-tool

Detects and merges duplicate contacts in CRM databases using advanced fuzzy matching algorithms

E-commerce Email Scraper πŸ”πŸ›’πŸ“§ - Cheap & Advanced

scrapestorm/e-commerce-email-scraper---cheap-advanced

πŸ” Scrape E-commerce Emails Easily Enter your search parameters (e.g product keywords, email domains & platform) to collect verified seller or store contacts along with product title, store description & more πŸ“Š Perfect for e-commerce lead generation, B2B outreach, product research & market analysis

119

5.0

Advanced Ebay Scraper – Extract Product Data, Prices & Reviews

sovanza.inc/advanced-ebay-scraper-extract-product-data-prices-reviews

The eBay Product Scraper is a powerful Apify actor designed to extract detailed product data from eBay listings, including price, images, seller information, product variants, and reviews. It is ideal for e-commerce research, competitor analysis, and price monitoring.

Trustpilot Scraper Pro

coder_zoro/Trustpilot-Scraper-Pro

Trustpilot Scraper Pro is a powerful Apify actor that extracts detailed business information and customer reviews from Trustpilot. Choose between two modes: scrape company data (name, rating, contact, etc.) or collect reviews with filters.

Related articles

Product matching AI: pricing intelligence powered by web scraping
Read more