![]() |
VOOZH | about |
Support and Confidence are two important metrices in data mining as it tells us how strong the patterns and trends are that we identify within data. In this article we will learn about them.
Support refers to the relative frequency of an item set in a dataset. It is used to identify frequent item sets in a dataset which can be used to generate association rules. For example, if we set the support threshold to 5% then any itemset that occurs in more than 5% of the transactions in the dataset will be considered as a frequent itemset.
Where:
In a dataset of 100 transactions in a store. If 30 of these transactions include both bread and butter, then support for rule "bread butter" would be:
This means that 30% of the transactions in the dataset contain both bread and butter.
Confidence is a measure that indicates how likely it is that item Y will appear in a transaction given that item X is already in the transaction. It is a way of evaluating the strength of association between two items.
Where:
In a dataset with 100 transactions if 40 transactions contain bread and 20 transactions contain both bread and butter then confidence for the rule "bread butter" would be:
This means that when bread is bought there is a 50% chance that butter will be bought as well.
Support and confidence work together to show how strong and useful a rule or pattern is in data analysis.
But just because something has high support doesn’t mean it will have high confidence and vice versa. For example an item may appear a lot (high support) but the link between items might not be strong (low confidence).
The table below summarizes the key points between Support and Confidence:
| Aspect | Support | Confidence |
|---|---|---|
| Definition | Measures how often an itemset appears in a dataset. | Measures the likelihood that an itemset will appear if another itemset appears. |
| Formula | ||
| Purpose | Identifies itemsets that occur frequently in the dataset. | Evaluates the strength of an association rule. |
| Threshold Usage | Often used with a threshold to identify itemsets that occur frequently enough to be of interest. | Often used with a threshold to identify rules that are strong enough to be of interest. |
| Interpretation | Interpreted as the percentage of transactions in which an itemset appears. | Interpreted as the percentage of transactions where the second itemset appears, given that the first itemset appears. |
| Usage in Data Mining | Used for identifying frequent itemsets. | Used for evaluating association rules. |
Read More: