How to calculate the probability of survival of the Titanic catastrophe_080420201050

080420201050

practical use: predict_proba

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
In [2]:
from catboost.datasets import titanic

train_df, test_df = titanic()

#train_df.head()

df = train_df 
print(df.shape)
df.head(3)
(891, 12)
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

Using the classification model, let’s check what happened to the three accidental passengers of the unlucky cruise.

We choose passengers:

In [3]:
df.loc[df['PassengerId']==422]
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
421 422 0 3 Charters, Mr. David male 21.0 0 0 A/5. 13032 7.7333 NaN Q
In [4]:
df.loc[df['PassengerId']==20]
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.225 NaN C
In [5]:
df.loc[df['PassengerId']==42]
Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
41 42 0 2 Turpin, Mrs. William John Robert (Dorothy Ann … female 27.0 1 0 11668 21.0 NaN S

Display variable types and unique values

In [6]:
import numpy as np
a,b = df.shape     #<- ile mamy kolumn
b 

for i in range(1,b):
    i = df.columns[i]
    h = df[i].nunique()
    f = df[i].dtypes
          
    print(f,"---",h,"---", i)
int64 --- 2 --- Survived
int64 --- 3 --- Pclass
object --- 891 --- Name
object --- 2 --- Sex
float64 --- 88 --- Age
int64 --- 7 --- SibSp
int64 --- 7 --- Parch
object --- 681 --- Ticket
float64 --- 248 --- Fare
object --- 147 --- Cabin
object --- 3 --- Embarked

Showing missing items

In [7]:
print('NUMBER OF EMPTY RECORDS vs. FULL RECORDS')
print('----------------------------------------')
for i in range(1,b):
    i = df.columns[i]
    r = df[i].isnull().sum()
    h = df[i].count()
   
    if r > 0:
        print(i,"--------",r,"--------",h) 
NUMBER OF EMPTY RECORDS vs. FULL RECORDS
----------------------------------------
Age -------- 177 -------- 714
Cabin -------- 687 -------- 204
Embarked -------- 2 -------- 889

We put values out of range in place of missing values

In [8]:
df.fillna(-777, inplace=True)

Model results

H0: the passenger will be saved (marked as: 1)
H1: The passenger will not be saved (marked as: 0)

0 means he drowned and died in a catastrophe.

We enter an additional Sex column to know what sex the passenger was.

In [9]:
df['Sex'] = df.Sex.map({'female':0, 'male':1})

Display DISCRETE variables

In [10]:
a,b = df.shape     #<- ile mamy kolumn
b 

print('ONLY DISCRETE FUNCTION')
print('----------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    print(i,f)
    
    if f == np.object:
        
        df[i] = pd.Categorical(df[i]).codes
        
        break
ONLY DISCRETE FUNCTION
----------------------
Survived int64
Pclass int64
Name object

6. Cyfrowanie, kodowanie zmiennych dyskretnych, kategorycznych

In [11]:
a,b = df.shape     #<- ile mamy kolumn
b 

print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue
    

df.head()
DISCRETE FUNCTIONS CODED
------------------------
Ticket --- object
Cabin --- object
Embarked --- object
Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 108 1 22.0 1 0 523 7.2500 0 3
1 2 1 1 190 0 38.0 1 0 596 71.2833 82 1
2 3 1 3 353 0 26.0 0 0 669 7.9250 0 3
3 4 1 1 272 0 35.0 1 0 49 53.1000 56 3
4 5 0 3 15 1 35.0 0 0 472 8.0500 0 3

Division into test and training variables

I have dataframe X with variables in the format: e.g. int64 and y ‘in the air’, we divide into test and training variables. I could have had this variable in the air in dataFrame as an additional column.

In [12]:
X = df.drop('Survived', axis=1) 
y = df['Survived']  


train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=1)

Forecast of probability of survival on Titanic

We know that the survival of the Titanic catastrophe was determined by several features such as gender and age.
We select a random passenger from record no. 422

In [13]:
pasażer_422 = test_X.loc[test_X['PassengerId']==422] 
pasażer_422
Out[13]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
421 422 3 155 1 21.0 0 0 534 7.7333 0 2

What is the probability that the passenger will survive?

we will use: predict_proba

Passenger no. 422 Charters, Mr. David

A young man of 21, traveling in the third grade. the model gave him a 30% chance of survival, unfortunately the man died in the crash.

In [14]:
model_RFC1 = RandomForestClassifier(random_state=0).fit(train_X, train_y)
model_RFC1.predict_proba(pasażer_422)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Out[14]:
array([[0.7, 0.3]])

Passenger no. 20 Masselmani, Mrs. Fatima

woman of unknown age, traveling in the third grade. The algorithm gave her 80% that she would survive. The woman survived the catastrophe.

In [15]:

pasażer_20 = test_X.loc[test_X['PassengerId']==20]  
pasażer_20
Out[15]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
19 20 3 512 0 -777.0 0 0 184 7.225 0 1
In [16]:
model_RFC1.predict_proba(pasażer_20)
Out[16]:
array([[0.2, 0.8]])

We draw another passenger

Our casual passenger, Mrs. Turpin, Mrs. William John Robert,

traveling alone in the second grade at the age of 27, also did not survive the disaster. The computer gave her a 10% chance of survival.

In [17]:
passenger = 5
random_passenger = train_X.iloc[passenger] 
random_passenger
Out[17]:
PassengerId     42.0
Pclass           2.0
Name           827.0
Sex              0.0
Age             27.0
SibSp            1.0
Parch            0.0
Ticket          53.0
Fare            21.0
Cabin            0.0
Embarked         3.0
Name: 41, dtype: float64
In [18]:
data_array = random_passenger.values.reshape(1, -1)
data_array
Out[18]:
array([[ 42.,   2., 827.,   0.,  27.,   1.,   0.,  53.,  21.,   0.,   3.]])
In [19]:
model_RFC1.predict_proba(data_array)
Out[19]:
array([[0.9, 0.1]])

What characteristics determined the model’s classification?

Passenger no. 422 Charters, Mr. David

We will use the SHAP tool. This is an abbreviation of SHPley Additive exPlanations

In [22]:
import shap
expl = shap.TreeExplainer(model_RFC1)
In [27]:
shap_values = expl.shap_values(pasażer_422)
shap.initjs()
shap.force_plot(expl.expected_value[1], shap_values[1],pasażer_422)

Interpretation

base value – the average output data of the model in the given set of training data output value – is the quality of the current model. The functions raising the prediction above are shown in red, those raising the prediction below in blue. Each of the functions is shown separately, its size on the axis indicates how much it contributed to the improvement (red) or deterioration (blue) of the model’s properties relative to the average quality of the model (base value ). For passenger No. 422 the model was of lower quality than average because of mainly gender. Unfortunately, the model correctly predicted the death of the passenger.

Passenger no. 20 Masselmani, Mrs. Fatima

In [29]:
hap_values = expl.shap_values(pasażer_20)
shap.initjs()
shap.force_plot(expl.expected_value[1], shap_values[1],pasażer_20)