How to survive on the Titanic?
Long time I was thinking, that this sample is a something as any railway statistic or other theoretical base. When I realise, that it is real list of Titanic passengers any investigations with it became more exciting.
Let’s see ones again what we can see in the data about this apocaliptic catastrof.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('c:/1/kaggletrain.csv', skipinitialspace=True, index_col=False)
df.head(3)
del df['Cabin']
del df['Ticket']
del df['Unnamed: 0']
del df['PassengerId']
I removed the Cabin column, I also removed the 'Ticket’ column, I think the ticket number did not affect the survival of the flight.
Now I will delete records where values are missing in the 'Embarked’ and 'Age’ columns.
df = df.dropna(how='any')
df.isnull().sum()
Incomplete records have been deleted. I am researching other independent variables.
df.Embarked.value_counts(normalize=True)
df.Sex.value_counts(normalize=True)
The SibSp variable is quite enigmatic, let’s check it out!
df.SibSp.value_counts()
df.SibSp.value_counts(normalize=True).plot(kind='bar')
As we can see, 469 passengers travel alone and 183 passengers with one family member.
Analysis of balancing the set of result variables
I will not examine variables anymore, I will only check if the dependent variable is balanced. Is the number of rescued and victims similar. The collections must be balanced for successful classification. Fortunately for us as researchers, and unfortunately for us and all mankind, a similar number of people as the number of people saved died on the Titanic.
df.Survived.value_counts(normalize=True).plot(kind='bar')
Sklearn logistic regression
When building each model, you need to consider all possible data, including (and sometimes above all) text (discrete) data. To use text data, it must be converted to digital data.
We divide independent variables into text and numeric ones
df.dtypes
del df['Name']
categorical = df.describe(include=["object"]).columns
continuous = df.describe().columns
categorical
continuous
We convert discrete variables to digitally coded variables
from sklearn.preprocessing import LabelEncoder
df[categorical] = df[categorical].apply(LabelEncoder().fit_transform)
df[categorical].sample(6)
We divide the collections into test and training
y = df['Survived']
X = df.drop('Survived' , axis=1)
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)
print ('Training X set: ',Xtrain.shape)
print ('Test X set ', Xtest.shape)
print ('Training y set: ', ytrain.shape)
print ('Test y set ', ytest.shape)
Logistic Regression model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)
LR_Grid.fit(Xtrain, ytrain)
Evaluation of the logistic regression model
ypred = LR_Grid.predict(Xtest)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix
print(classification_report(ytest, ypred))
Tensorflow classification
import tensorflow as tf
df.head(5)
Step 1) The data is already divided into a test set and a training set.
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
Step 2) Change from continuous variables to Tensorflow variables, function:
tf.feature_column.numeric_column
columns should be converted to a tensor.
You can change one column to the tensor, for example:
Age = tf.feature_column.numeric_column('Age')
Age
You can also change all columns together to tensorflow tensors.
FEATURES = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
continuous_features = [tf.feature_column.numeric_column(k) for k in FEATURES]
continuous_features
Model creation
LinearClassifier The syntax for the linear classifier is the same as for Tensorflow linear regression, except for one argument: n_class. You must define a function column, model catalog and compare with a linear regressor; you have a defined number of classes. For logit regression, the number of classes is 2.
1. Define of classifier
model = tf.estimator.LinearClassifier(
n_classes = 2,
model_dir="ongoing/train7",
feature_columns=continuous_features)
2. Create the input function
After defining the classifier, you can create an input function. The method is the same as in linear regression. Here you use batch size 128 and shuffle the data.
FEATURES = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
LABEL = 'Survived'
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
y = pd.Series(data_set[LABEL].values),
batch_size=n_batch,
num_epochs=num_epochs,
shuffle=shuffle)
3. Train the model
Let’s train the model with the model.train object. The function defined earlier is used to supply the model with the appropriate values. Remember that you set the lot size to 128 and the number of epochs to None. The model will be trained in a thousand steps.
FEATURES = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
LABEL = 'Survived'
model.train(input_fn=get_input_fn(df_train,
num_epochs=None,
n_batch = 128,
shuffle=False),
steps=1000)
4. Evaluate the performance of model
The final loss after a thousand iterations is 60.84248. You can estimate your model on the test kit and see the performance. You need to use object rating to evaluate your model’s performance. You feed the model with the test kit and set the number of epochs to 1, i.e. the data will go to the model only once.
model.evaluate(input_fn=get_input_fn(df_test,
num_epochs=1,
n_batch = 128,
shuffle=False),
steps=1000)
