
practical use: predict_proba
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from catboost.datasets import titanic
train_df, test_df = titanic()
#train_df.head()
df = train_df
print(df.shape)
df.head(3)
Using the classification model, let’s check what happened to the three accidental passengers of the unlucky cruise.¶
We choose passengers:
df.loc[df['PassengerId']==422]
df.loc[df['PassengerId']==20]
df.loc[df['PassengerId']==42]
Display variable types and unique values
import numpy as np
a,b = df.shape #<- ile mamy kolumn
b
for i in range(1,b):
i = df.columns[i]
h = df[i].nunique()
f = df[i].dtypes
print(f,"---",h,"---", i)
Showing missing items
print('NUMBER OF EMPTY RECORDS vs. FULL RECORDS')
print('----------------------------------------')
for i in range(1,b):
i = df.columns[i]
r = df[i].isnull().sum()
h = df[i].count()
if r > 0:
print(i,"--------",r,"--------",h)
We put values out of range in place of missing values¶
df.fillna(-777, inplace=True)
Model results
H0: the passenger will be saved (marked as: 1)
H1: The passenger will not be saved (marked as: 0)
0 means he drowned and died in a catastrophe.
We enter an additional Sex column to know what sex the passenger was.¶
df['Sex'] = df.Sex.map({'female':0, 'male':1})
Display DISCRETE variables
a,b = df.shape #<- ile mamy kolumn
b
print('ONLY DISCRETE FUNCTION')
print('----------------------')
for i in range(1,b):
i = df.columns[i]
f = df[i].dtypes
print(i,f)
if f == np.object:
df[i] = pd.Categorical(df[i]).codes
break
6. Cyfrowanie, kodowanie zmiennych dyskretnych, kategorycznych
a,b = df.shape #<- ile mamy kolumn
b
print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
i = df.columns[i]
f = df[i].dtypes
if f == np.object:
print(i,"---",f)
if f == np.object:
df[i] = pd.Categorical(df[i]).codes
continue
df.head()
Division into test and training variables¶
I have dataframe X with variables in the format: e.g. int64 and y 'in the air’, we divide into test and training variables. I could have had this variable in the air in dataFrame as an additional column.
X = df.drop('Survived', axis=1)
y = df['Survived']
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)
Forecast of probability of survival on Titanic¶
We know that the survival of the Titanic catastrophe was determined by several features such as gender and age.
We select a random passenger from record no. 422
pasażer_422 = test_X.loc[test_X['PassengerId']==422]
pasażer_422
model_RFC1 = RandomForestClassifier(random_state=0).fit(train_X, train_y)
model_RFC1.predict_proba(pasażer_422)
Passenger no. 20 Masselmani, Mrs. Fatima¶
woman of unknown age, traveling in the third grade. The algorithm gave her 80% that she would survive. The woman survived the catastrophe.
model_RFC1.predict_proba(pasażer_20)
passenger = 5
random_passenger = train_X.iloc[passenger]
random_passenger
data_array = random_passenger.values.reshape(1, -1)
data_array
model_RFC1.predict_proba(data_array)
import shap
expl = shap.TreeExplainer(model_RFC1)
shap_values = expl.shap_values(pasażer_422)
shap.initjs()
shap.force_plot(expl.expected_value[1], shap_values[1],pasażer_422)
Interpretation¶
base value – the average output data of the model in the given set of training data output value – is the quality of the current model. The functions raising the prediction above are shown in red, those raising the prediction below in blue. Each of the functions is shown separately, its size on the axis indicates how much it contributed to the improvement (red) or deterioration (blue) of the model’s properties relative to the average quality of the model (base value ). For passenger No. 422 the model was of lower quality than average because of mainly gender. Unfortunately, the model correctly predicted the death of the passenger.
hap_values = expl.shap_values(pasażer_20)
shap.initjs()
shap.force_plot(expl.expected_value[1], shap_values[1],pasażer_20)