VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/04/proximity-measures-in-data-mining-and-machine-learning/

โ‡ฑ Proximity measures in Data Mining and Machine Learning


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

Proximity measures in Data Mining and Machine Learning

Chirag Goyal Last Updated : 19 Apr, 2021
6 min read
This article was published as a part of the Data Science Blogathon.

Introduction

Data mining is the process of finding interesting patterns in large quantities of data. While implementing clustering algorithms, it is important to be able to quantify the proximity of objects to one another. Proximity measures are mainly mathematical techniques that calculate the similarity/dissimilarity of data points. Usually, proximity is measured in terms of similarity or dissimilarity i.e., how alike objects are to one another.

Real-Life Example Use-case : Predicting COVID-19 patients on the basis of their symptoms

With the rise of COVID-19 cases, many people are not being able to seek proper medical advice due to the shortage of both human and infrastructure resources. As a result, we as engineers can contribute our bit to solve this problem by providing a basic diagnosis to help in identifying the people suffering from COVID-19. To help us we can make use of Machine Learning algorithms to ease out this task, among which clustering algorithms come in handy to use.

For this, we make two clusters based on the symptoms of the patients who are COVID-19 positive or negative and then predict whether a new incoming patient is suffering from COVID-19 or not by measuring the similarity/dissimilarity of the observed symptoms (features) with that of the infected personโ€™s symptoms.

Proximity measures are different for different types of attributes. 

Similarity measure:

 โ€“Numerical measure of how alike two data objects are.

โ€“ Is higher when objects are more alike.

โ€“ Often falls in the range [0,1].

Dissimilarity measure:

 โ€“Numerical measure of how different two data objects are.

โ€“ Lower when objects are more alike.

โ€“ Minimum dissimilarity is often 0.

โ€“ Upper limit varies.

Dissimilarity Matrix

Dissimilarity matrix is a matrix of pairwise dissimilarity among the data points. It is often desirable to keep only lower triangle or upper triangle of a dissimilarity matrix to reduce the space and time complexity.

1. Itโ€™s square and symmetric(AT= A for a square matrix A, where AT represents its transpose).

2. The diagonals members are zero, meaning that zero is the measure of dissimilarity between an element and itself.

Proximity measures for Nominal Attributes

Nominal attributes can have two or more different states e.g. an attribute โ€˜colorโ€™ can have values like โ€˜Redโ€™, โ€˜Greenโ€™, โ€˜Yellowโ€™, โ€˜Blueโ€™, etc. Dissimilarity for nominal attributes is calculated as the ratio of total number of mismatches between two data points to the total number of attributes.

Nominal means โ€œrelating to names.โ€ The values of a nominal attribute are
symbols or names of things. Each value represents some kind of category, code,
or state and so nominal attributes are also referred to as categorical.

Examples: ID numbers, eye color, zip codes.

Let M be the total number of states of a nominal attribute. Then the states can be numbered from 1 to M. However, the numbering does not denote any kind of ordering and can not be used for any mathematical operations.

Let m be total number of matches between two-point attributes and p be total number of attributes, then the dissimilarity can be calculated as,

                                                        d(i,  j)=(p-m)/p

We can calculate similarity as,

                                                        s(i, j)=1-d(i, j)

EXAMPLE,

                       Roll No                           Marks                         Grades
                        1                            96                            A
                        2                            87                            B
                        3                             83                            B
                        4                             96                            A

In this example we have four objects as Roll No from 1 to 4.

Now, we apply the formula(described above) for finding the proximity of nominal attributes:

โ€“ d(1,1)= (p-m)/p = (2-2)/2 = 0                  โ€“ d(2,2)= (p-m)/p = (2-2)/2 = 0

โ€“ d(2,1)= (p-m)/p = (2-0)/2 = 1                  โ€“ d(3,2)= (p-m)/p = (2-1)/2 = 0.5

โ€“ d(3,1)= (p-m)/p = (2-2)/2 = 1                  โ€“ d(4,2)= (p-m)/p = (2-0)/2 = 1

โ€“ d(4,1)= (p-m)/p = (2-2)/2 = 0                  โ€“ d(3,3)= (p-m)/p = (2-2)/2 = 0

โ€“ d(4,3)= (p-m)/p = (2-0)/2 = 1                  โ€“ d(4,4)= (p-m)/p = (2-2)/2 = 0

โ€“ As seen from the calculation, we observe that the similarity between an object with itself is 1, which seems intuitively correct.

Proximity measures for ordinal attributes

An ordinal attribute is an attribute whose possible values have a meaningful order or ranking among them, but the magnitude between successive values is not known. However, to do so, it is important to convert the states to numbers where each state of an ordinal attribute is assigned a number corresponding to the order of attribute values.

Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short}.

Since a number of states can be different for different ordinal attributes, it is therefore required to scale the values to a common range, e.g [0,1]. This can be done using the given formula,

                                                         zif=(rifโˆ’1)/(Mfโˆ’1)

where M is a maximum number assigned to states and r is the rank(numeric value) of a particular object.

The similarity can be calculated as:

                                                         s(i, j)=1-d(i, j)

EXAMPLE,

Object ID

Attribute

                                     1                                   High
                                     2                                   Low
                                     3                                   Medium
                                     4                                   High

In this example, we have four objects having ID from 1 to 4.

Here for encoding our attribute column, we consider High=1, Medium=2, and Low=3. And, the value of Mf=3(since there are three states available)

Now, we normalize the ranking in the range of 0 to 1 using the above formula.

So,  High=(1-1)/(3-1)=0,  Medium=(2-1)/(3-1)=0.5,  Low=(3-1)/(3-1)=1.

Finally, we are able to calculate the dissimilarity based on difference in normalized values corresponding to that attribute.

โ€“ d(1,1)= 0-0 = 0                               โ€“ d(2,2)= 3-3 = 0

โ€“ d(2,1)= 1-0= 1                                โ€“ d(3,2)= 0.5-0 = 0.5

โ€“ d(3,1)= 0.5-0 = 0.5                         โ€“ d(4,2)= 1-0 = 1

โ€“ d(4,1)= 0-0 =0                                 โ€“ d(3,3)= 0.5-0.5 = 0

โ€“ d(4,3)= 0.5-0=0                               โ€“ d(4,4)= 0-0 = 0

End Notes

Thanks for reading! ๐Ÿ˜Š

This brings us to the end of our article on proximity measures for nominal and ordinal attributes. I hope you liked my article. Now, as far as proximity measures for binary and numeric attributes are concerned, Well, thatโ€™s another blog post for another time.

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And Iโ€™ll get back to you. ๐Ÿ™‚

Till then Stay Home, Stay Safe to prevent the spread of COVID-19 and Keep Learning!

Chirag Goyal

Currently, I pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

๐Ÿ‘ Proximity meaures Chirag Goyal - Student - Indian Institute of Technology , Jodhpur | LinkedIn

The media shown in this article are not owned by Analytics Vidhya and is used at the Authorโ€™s discretion. 

I am a B.Tech. student (Computer Science major) currently in the pre-final year of my undergrad. My interest lies in the field of Data Science and Machine Learning. I have been pursuing this interest and am eager to work more in these directions. I feel proud to share that I am one of the best students in my class who has a desire to learn many new things in my field.

Login to continue reading and enjoy expert-curated content.

Free Courses

Ensemble Learning and Ensemble Learning Techniques

Learn ensemble learning, its techniques, and how it works in this course!

Bagging and Boosting ML Algorithms

Explore Bagging and Boosting to understand advanced ML algorithms.

Naive Bayes from Scratch

Master Naรฏve Bayes for ML: Build classifiers, analyze data, and apply Bayes.

Dimensionality Reduction for Machine Learning

Master key dimensionality reduction techniques for ML success!

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner