This Pokemon will analyze the pokemon dataset and predict whether the Pokemon is legendary based on the features provided. We will discuss everything from scratch; we will go from CSV to model building with line by line explanation of code. Let’s get started.

👁 Pokemon Prediction using Random Forest

Image source: Pokejungle

Takeaways

Understand how to analyze the dataset before carrying forward to the model building phase.
Getting the insights from the data.
Visualization of the dataset.
Model building
Saving model.

About the dataset

This dataset has 721 unique values i.e. it has features of 721 unique pokemon; for further details, visit this link.

👁 Dataset

Image sourlet’saggle

Importing necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier

Reading the dataset

pokemon_data = pd.read_csv('Pokemon Data.csv')

Now, let’s see what our dataset has in it!

poke = pd.DataFrame(pokemon_data)
poke.head()

Output:

👁 Importing necessary libraries | Output

Checking out folet’sl values

poke.isnull().sum()

Output:

Number 0
Name 0
Type_1 0
Type_2 371
Total 0
HP 0
Attack 0
Defense 0
Sp_Atk 0
Sp_Def 0
Speed 0
Generation 0
isLegendary 0
Color 0
hasGender 0
Pr_Male 77
Egg_Group_1 0
Egg_Group_2 530
hasMegaEvolution 0
Height_m 0
Weight_kg 0
Catch_Rate 0
Body_Style 0
dtype: int64

We have seen the null values in its users n; let’s visualize them using the heatmap.

plt.figure(figsize=(10,7))
sns.heatmap(poke.isnull(), cbar=False)

Output:

👁 Checking out folet'sl values | Output

Here it’s visible that Type_2, Pr_Male, and Egg_Group_2 have relatively null values.

We have visualized the nucan’tlues using the heatmap but in that kind of visualization, we can’t get the count of Let’s null values, so we are using the dist-plot.

plt.figure(figsize=(20,20))
sns.displot(
 data=poke.isna().melt(value_name="missing"),
 y="variable",
 hue="missing",
 multiple="fill",
 aspect=2
)

Output:

👁 Pokemon Prediction using Random Forest | Output

Let’s know the dimensions of our dataset.

poke.shape

Output:

(721, 23)

From the shape, it is clear the dataset is small, meaning we can remove the null values columns as filling them can make the dataset a little biased.

We have seen that type_2, egg_group_2, and Pr_male have null values.

poke['Pr_Male'].value_counts()

Output:

0.500 458
0.875 101
0.000 23
0.250 22
0.750 19
1.000 19
0.125 2
Name: Pr_Male, dtype: int64

Since Type_2 and Egg_group_2 columns have so many NULL values we will be removing those columns, you won’t impute them with other methods, but for simplicity, we won’t do that here. We only set the Pr_Male column since it had only 77 missing values.

poke['Pr_Male'].fillna(0.500, inplace=True)
poke['Pr_Male'].isnull().sum()

Output:

0 # as we can see that there are no null values now.

Dropping unnecessary columns

new_poke = poke.drop(['Type_2', 'Egg_Group_2'], axis=1)

Now let’s understand the type of each column and its values.

new_poke.describe()

Note : (20, 20000) -> x -min/ max-min -> x = 300 -> 300-20/19980 -> a very small value

Output:

👁 Image

plt.figure(figsize=(10,10))
sns.heatmap(new_poke.corr(),annot=True,cmap='viridis',linewidths=.5)

Output:

👁 Pokemon Prediction using Random Forest | Output

The above is a correlation graph that tells you how much a feature is correlated to another since a high correlation means one of the two features does not speak much to the model when predicting.
Usually, it is to be determined by you itself for the high value of correlation and removed.
From the above table, it is clear that different features have different ranges of value, which creates complexity for the model, so we tone them down usually using StandardScalar() class which we will do later on.

new_poke['Type_1'].value_counts()

Output:

Water 105
Normal 93
Grass 66
Bug 63
Psychic 47
Fire 47
Rock 41
Electric 36
Ground 30
Poison 28
Dark 28
Fighting 25
Dragon 24
Ice 23
Ghost 23
Steel 22
Fairy 17
Flying 3
Name: Type_1, dtype: int64

Value counts of all the generations

new_poke['Generation'].value_counts()

Output:

5 156
1 151
3 135
4 107
2 100
6 72
Name: Generation, dtype: int64

Visualizing I’me categorical values

Here for visualizing the categorical data, I’m using seaborn’s cat plot() function. Well, one can use the line plot scatter plot or box plot separately, but here, the cat plot brings up the unified version of using all the plots hence I preferred the cat plot rather than the separate version of eI’m plot.

Here for counting each type (6) category of generations, I’m using the cougeneration’snd in the cat plot to get the number of count of each generation’s column.

sns.catplot(x="Generation",kind="count",palette="ch:.25", data=poke)

Output:

👁 Visualizing Categorial Values | Pokemon Prediction using Random Forest

Inference: In the above graph, the 5th generation is the most in numbers.

Here we are using the default kind of cat plot, i.e. scatter plot to plot the Generation vs Defense graph where we will be able to figure outPokemonlationship between the defence power of each general Pokemon.

sns.catplot(x="Generation", y="Defense", data=poke)

Output:

👁 Output 3| Pokemon Prediction using Random Forest

Inference: Here, we can see that only two pcan’tn in generation 2 have the highest defence capability. Still, we can’t conclude that generation 2 has the most increased defence capabilities as the outliers. Still, in the graph, it is evident that generation 6 and 4 has the highest defence capabilities.

Here we are using the Box plot because boxplot will help us understand the variations in the large dataset better; it will also let us know about the outliers more clearly.

sns.catplot(x="Generation", y="Attack",kind="boxen", data=poke)

Output:

👁 Output 2 | Pokemon Prediction using Random Forest

Here in the above boxplot, we can see that there are a lot of outliers in generation 4 and generation 1 when it comes to attacking capabilities.
Also, generation 4 has the highest median values of their attacking capabilities than all the other generations.

Now we are using bar kind via cat plot, which will let us know about the Attacking capabilities of different generations based on their Pokemon. For example, in generation 1, the pokemon power of male Pokemon are higher than those of the female Pokemon of the same generation. Still, that generation also has the least attacking power than other generations.

sns.catplot(x="Generation", y="Attack",kind='bar',hue='hasGender', data=poke)

Output:

👁 Output | Pokemon Prediction using Random Forest

FromPokemonove graph, we can conclude that,

In generaPokemononly the male Pokemon has more attacking power than the female Pokemon, which contradicts other generations.
Generation 6 has the highest attacking power wLet’sgeneration 1 has the lowest attacking power.

new_poke['Color'].value_counts()

Output:

Blue 134
Brown 110
Green 79
Red 75
Grey 69
Purple 65
Yellow 64
White 52
Pink 41
Black 32
Name: Color, dtype: int64

new_poke['Egg_Group_1'].value_counts()

Output:

Field 169
Monster 74
Water_1 74
Undiscovered 73
Bug 66
Mineral 46
Flying 44
Amorphous 41
Human-Like 37
Fairy 30
Grass 27
Water_2 15
Water_3 14
Dragon 10
Ditto 1
Name: Egg_Group_1, dtype: int64

Let’s also consider the number of values in our target column

new_poke['isLegendary'].value_counts()

Output:

False 675
True 46
Name: isLegendary, dtype: int64

Feature Engineering

Creating new categories or merging categories, so it is easy to work with afterwards.
This may seem uncomfortable to some, but you will get why I did it like that.

poke_type1 = new_poke.replace(['Water', 'Ice'], 'Water')
poke_type1 = poke_type1.replace(['Grass', 'Bug'], 'Grass')
poke_type1 = poke_type1.replace(['Ground', 'Rock'], 'Rock')
poke_type1 = poke_type1.replace(['Psychic', 'Dark', 'Ghost', 'Fairy'], 'Dark')
poke_type1 = poke_type1.replace(['Electric', 'Steel'], 'Electric')

poke_type1['Type_1'].value_counts()

Output:

Grass 129
Water 128
Dark 115
Normal 93
Rock 71
Electric 58
Fire 47
Poison 28
Fighting 25
Dragon 24
Flying 3
Name: Type_1, dtype: int64

ref1 = dict(poke_type1['Body_Style'].value_counts())

poke_type1['Body_Style_new'] = poke_type1['Body_Style'].map(ref1)

You may be wondering what I did; I took the value counts of each body tyLet’sd replace the body type with the numbers; see below

poke_type1['Body_Style_new'].head()

Output:

0 135
1 135
2 135
3 158
4 158
Name: Body_Style_new, dtype: int64

Let’s look towards the Body_style

poke_type1['Body_Style'].head()

Output:

0 quadruped
1 quadruped
2 quadruped
3 bipedal_tailed
4 bipedal_tailed
Name: Body_Style, dtype: object

Encoding data – features like Type_1 and Color

types_poke = pd.get_dummies(poke_type1['Type_1'])
color_poke = pd.get_dummies(poke_type1['Color'])

X = pd.concat([poke_type1, types_poke], axis=1)
X = pd.concat([X, color_poke], axis=1)

X.head()

Output:

👁 Encoding data | Pokemon Prediction using Random Forest

Now we have built some features and extracted some feature data, what’s left is to remove redundant features

X.columns

Output:

Index(['Number', 'Name', 'Type_1', 'Total', 'HP', 'Attack', 'Defense',
 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation', 'isLegendary', 'Color',
 'hasGender', 'Pr_Male', 'Egg_Group_1', 'hasMegaEvolution', 'Height_m',
 'Weight_kg', 'Catch_Rate', 'Body_Style', 'Body_Style_new', 'Dark',
 'Dragon', 'Electric', 'Fighting', 'Fire', 'Flying', 'Grass', 'Normal',
 'Poison', 'Rock', 'Water', 'Black', 'Blue', 'Brown', 'Green', 'Grey',
 'Pink', 'Purple', 'Red', 'White', 'Yellow'],
 dtype='object')

X_ = X.drop([‘Number’, ‘Name’, ‘let’s1’, ‘Color’, ‘Egg_Group_1’], axis = 1)
X_.shape

Output:

(721, 38)

Now, let’s see the shape of our updated feature columns

X.shape

Lastly, we define our target variable and set it into a variable called y

y = X_['isLegendary']
X_final = X_.drop(['isLegendary', 'Body_Style'], axis = 1)
X_final.columns

Output:

Index(['Total', 'HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed',
 'Generation', 'hasGender', 'Pr_Male', 'hasMegaEvolution', 'Height_m',
 'Weight_kg', 'Catch_Rate', 'Body_Style_new', 'Dark', 'Dragon',
 'Electric', 'Fighting', 'Fire', 'Flying', 'Grass', 'Normal', 'Poison',
 'Rock', 'Water', 'Black', 'Blue', 'Brown', 'Green', 'Grey', 'Pink',
 'Purple', 'Red', 'White', 'Yellow'],
 dtype='object')

X_final.head()

Output:

👁 Encoding Data | Pokemon Prediction using Random Forest

Creating and training our model

Splitting the dataset into training and testing dataset

Xtrain, Xtest, ytrain, ytest = train_test_split(X_final, y, test_size=0.2)

Using random forest classifier for training our model

random_model = RandomForestClassifier(n_estimators=500, random_state = 42)

Fitting the model

model_final = random_model.fit(Xtrain, ytrain)
y_pred = model_final.predict(Xtest)

Checking the accuracy

random_model_accuracy = round(model_final.score(Xtrain, ytrain)*100,2)
print(round(random_model_accuracy, 2), '%')

Output:

100.0 %

Getting the accuracy of the model

random_model_accuracy1 = round(random_model.score(Xtest, ytest)*100,2)
print(round(random_model_accuracy1, 2), '%')

Output:

99.31 %

Saving the model to disk

import pickle
filename = 'pokemon_model.pickle'
pickle.dump(model_final, open(filename, 'wb'))

Load the model from the disk

filename = 'pokemon_model.pickle'
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(Xtest, ytest)

result*100

Output:

99.3103448275862

Conclusion

Here I conclude the legendary pokemon prediction with 99% accuracy; this might be a overfit model; having said that, the dataset was not so complex that it will lead to such a situaHere’set all the suggestions and improvements are always welcome.

Here’s the repo link to this article.

Here you can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon (link)

If got any queries you can connect with I’m on LinkedIn, refer to this link

About me

Greeting to everyone, I’m currently working in TCS and previously, I worked as a Data Science AssociI’veAnalyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field, i.e. Data Science, along with its other subsets of Artificial Intelligence such as Computer Vision, Machine learning, and Deep learning; feel free to collaborate with me on any project on the domains mentioned above (LinkedIn).

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

👁 Aman Preet

Aman Preet

Beginner Datasets Machine Learning