Personality Prediction Project using ML

Last Updated : 29 Aug, 2025

Myers-Briggs Type Indicator (MBTI) is used to predict personality type based on answers to a MBTI-style survey. The MBTI framework classifies personalities into 16 distinct types based on four dimensions involving how people perceive the world and make decisions. Let's make a machine learning model which will:

Learns from a dataset of social media posts labeled with MBTI types.
The textual data is converted into numerical features using TF-IDF vectorization, capturing the importance of words.
It combines text features with simulated or collected questionnaire answers representing preferences in social behavior, information processing, decision making, work style and values.
A Random Forest classifier is trained on this hybrid data to predict the personality type accurately.

Step-by-Step Implementation

Let's build our prediction model step by step and use it to predict our personality type:

Step 1: Install dependencies

We will install the required packages,

sentence-transformers generate embeddings for semantic similarity and search.
chromadb for vector database storage of user profiles.
joblibfor loading models.
Pandasand numpyfor numerical operations and manipulations.
Scikit learn and scipy for various ML modules.

Step 2: Import Libraries and Load Data

We will import the required libraries for our model and load the MBTI dataset which contains user posts and their MBTI labels

pandas: Used for data manipulation and loading CSV files.
LabelEncoder: Converts MBTI personality type labels (strings) into numeric codes for classification.
train_test_split: Splits dataset into training and testing subsets.
TfidfVectorizer: Converts user text data (posts) into numerical vectors using TF-IDF vectorization.

The MTBI dataset can be download from here.

Step 3: Encode Personality Labels and Split Dataset

We will encode the labels and also split the dataset for training and testing,

Label Encoder transforms MBTI labels into integers (e.g., 'INFP' -> 6).
Separates posts (X_text) and label codes (y).
Split: 80% training data, 20% testing to evaluate model generalization.

Step 4: TF-IDF Vectorization of Text Data

Now we:

Converts raw text posts into sparse matrices of TF-IDF features.
Limits to top 3000 frequent words for tractability.
Removes common English stop words to reduce noise.

Step 5: Simulate Questionnaire Data for Training

We simulate questions and answers for training the model.

Step 6: Combine Text and Questionnaire Features

Now we,

Horizontally stacks the TF-IDF vectors and questionnaire answer vectors.
Combines text content and survey responses into one feature matrix.
hstack efficiently handles sparse text vectors combined with dense questionnaire data.

Step 7: Train Random Forest Model and Evaluate Performance

RandomForestClassifier: Random Forest classifier is an ensemble tree-based model that combines many decision trees to improve accuracy and reduce overfitting.
n_estimators=100 specifies 100 trees in the forest.
random_state=42 ensures results can be reproduced.
After training on both text features and questionnaire answers, it predicts on the unseen test set.
accuracy_score: Shows overall proportion of correctly predicted instances.
classification_report: Provides detailed metrics per MBTI category for a nuanced evaluation.

Output:

👁 Screenshot-2025-08-29-094217

Training and Testing

Step 8: Save Trained Model and Vectorize for Use

Now we save the trained Random Forest model and all encoders/vectorizers to disk. These files are loaded later for interactive prediction after deployment.

To know more about saving and reusing the model we can refer to: Save and Load Machine Learning Models.

Step 9: Load Saved Models and Personality Description File

Here we,

Load the trained classifier, vectorizer and label encoder for inference.
Load a JSON file with textual personality descriptions for each MBTI type.
This allows showing detailed feedback on predictions.

The JSON file with personality description can be download from here.

Step 10: Questionnaire Setup and Interactive User Input

Now we,

Define the 5 MBTI survey questions with two answer options each.
Gets freeform self-description from user.
Then sequentially asks each MBTI question, collects responses as binary 0/1.

Output:

👁 questionnaire

Questions

Step 11: Vectorize Input and Combine Features

Converts the user’s text into a TF-IDF vector (same space as training).
Formats questionnaire answers as a numeric feature vector.
Stacks both into one hybrid vector for prediction.

Step 12: Make Personality Prediction and Output Description

Now our model,

Passes combined features through the trained model to predict the MBTI label code.
Converts numeric MBTI code back to string label.
Retrieves and prints the detailed MBTI description for user clarity.

Output:

👁 Screenshot-2025-08-29-102052

Personality Predicted by Model

As we saw that our model predicted the personality type of a person based on the answers of the questionnaire.

Step 13: Store the Profile in ChromaDB Vector Database

Our model,

Connects to ChromaDB (local vector DB) to store user profile embeddings.
Metadata contains MBTI type, answers and user text for rich querying.
Uses a unique UUID string as identifier for each stored profile.
Persists the profile for future user comparisons, recommendations or analytics.

Output:

Your profile has been saved to the personality database.

Step 14: Access the Database

We can access the ChromaDB database,

To get all stored metadata and IDs.
Retrieves all saved vectors’ metadata and ids (user texts and MBTI types stored in metadata).

Output:

Stored profile IDs: ['ff6ea2d8-0b78-47ea-b125-0d9baec116a2', '3665925b-1b07-489b-9108-7f4ad3914618']
Stored metadata example: [{'user_text': 'I am a calm person and an extrovert. I love to to explore things', 'mbti_type': 'INFP', 'answers': '[0, 1, 0, 1, 1]'},
{'mbti_type': 'INFP', 'answers': '[1, 0, 1, 0, 1]', 'user_text': 'I am a sad person'}]

The complete notebook can be download from here.

Comment

Article Tags:

Machine Learning

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Courses

URL: https://www.geeksforgeeks.org/machine-learning/overview-of-personality-prediction-project-using-ml/