Pricing
from $0.01 / 1,000 results
Go to Apify Store
Content Similarity Finder
Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 months ago
Last modified
Categories
Share
Content Similarity & Duplicate Finder
Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.
๐ฏ What It Does
Content Similarity Finder detects duplicate and near-duplicate content using multiple similarity algorithms: cosine similarity, Levenshtein distance, fuzzy matching, and Jaccard similarity.
โจ Key Features
- Multiple Algorithms: Cosine, Levenshtein, Fuzzy, Jaccard
- Configurable Threshold: Set minimum similarity (0-100%)
- Smart Normalization: Case-insensitive, whitespace handling
- Duplicate Grouping: Cluster similar items together
- Fast Processing: Optimized for large datasets
๐ Quick Start
{"content":[{"id":"1","text":"The quick brown fox jumps"},{"id":"2","text":"A quick brown fox jumps"},{"id":"3","text":"Completely different text"}],"similarityThreshold":0.8,"algorithms":{"cosine":true,"levenshtein":true,"fuzzy":true,"jaccard":true}}
๐ฅ Input
- content: Array of items with
idandtextfields - similarityThreshold: 0-1 (0.8 = 80% similar minimum)
- algorithms: Enable/disable cosine, levenshtein, fuzzy, jaccard
- caseSensitive: Treat case as significant (default: false)
- ignoreWhitespace: Normalize whitespace (default: true)
- minLength: Skip texts shorter than this
- groupByDuplicate: Cluster similar items (default: true)
๐ค Output
Similarity Matches
{"item1":"1","item2":"2","text1":"The quick brown fox","text2":"A quick brown fox","similarity":0.89,"algorithm":"cosine"}
Duplicate Groups (if groupByDuplicate: true)
{"totalGroups":1,"groups":[{"groupId":"group_1","members":["1","2"],"size":2}]}
๐ Use Cases
- Data Deduplication: Remove duplicate entries from databases
- Plagiarism Detection: Find copied content
- Content Moderation: Detect spam or repeated messages
- SEO Analysis: Find duplicate website content
- Data Cleaning: Merge similar records
๐ Algorithms
- Cosine Similarity: Best for semantic similarity (TF-IDF based)
- Levenshtein Distance: Best for typos, minor edits
- Fuzzy Matching: Best for approximate string matching
- Jaccard Similarity: Best for word overlap comparison
๐ License
MIT License
Clean data, better insights ๐
