In [1]:
import time
start_time = time.time() ## pomiar czasu: start pomiaru czasu
print(time.ctime())
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
I import data¶
import pandas as pd
df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df.head(3)
df.shape
df.Survived.value_counts().plot(kind='pie', autopct='
I check data completeness and delete records with empty NaN values¶
df.isnull().sum()
import seaborn as sns
print('Observation of variables: ',df.shape)
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='BuPu')
I delete several columns and empty records in Age columns¶
del df['Cabin']
del df['Ticket']
del df['Name']
del df['Unnamed: 0']
df=df.dropna(how='any')
df.isnull().sum()
Arrangement of columns with categorical and continuous data¶
df.columns
df.describe(include=["object"]).columns
df.describe(include=[np.number]).columns
categorical_columns = ['Sex', 'Embarked']
numerical_columns = ['PassengerId', 'Age', 'SibSp','Parch', 'Fare','Pclass']
df['Parch'].value_counts()
We determine that the output variable is the 'Stroke’ column¶
outputs = ['Survived']
Digitization of text variables¶
df.dtypes
We need to convert types for qualitative columns to category. We can do this using the astype () function, as shown below:
Introducing a new data type: 'category’¶
for category in categorical_columns:
df[category] = df[category].astype('category')
df.dtypes
df['Sex'].cat.categories
df['Embarked'].cat.categories
Digitization of data¶
df.dtypes
Why did we digitize data in the format?¶
The basic purpose of separating categorical columns from numeric columns is that the values in the numeric column can be directly input into neural networks. However, categorized column values must first be converted to numeric types.
categorical_columns
Conversion of categorical variables to Numpy matrix¶
p1 = df['Sex'].cat.codes.values
p2 = df['Embarked'].cat.codes.values
NumP_matrix = np.stack([p1, p2], 1)
NumP_matrix[:10]
Creating a Pytorch tensor from the Numpy matrix¶
categorical_data = torch.tensor(NumP_matrix, dtype=torch.int64)
categorical_data[:10]
Conversion of DataFrame numeric columns to a Pytorch tensor¶
numerical_data = np.stack([df[col].values for col in numerical_columns], 1)
numerical_data = torch.tensor(numerical_data, dtype=torch.float)
numerical_data[:5]
Convert result variables to the Pytorch tensor¶
outputs = torch.tensor(df[outputs].values).flatten()
outputs[:5]
Let’s sum up the tensors¶
print('categorical_data: ',categorical_data.shape)
print('numerical_data: ',numerical_data.shape)
print('outputs: ',outputs.shape)
settlement¶
We have transformed our categorical columns into numerical ones, in which the unique value is represented by one integer (digitization – e.g. smoker is 1). Based on such a column (variable), we can train the model, but there is a better way …
A better way is to represent the value in a categorical column as an N-dimensional vector instead of a single integer. This process is called deposition. The vector is able to capture more information and can find relationships between different categorical values in a more appropriate way. Therefore, we will represent values in categorical columns in the form of N-dimensional vectors.
We need to define the embedding size (vector dimensions) for all qualitative columns. There is no hard and fast rule regarding the number of dimensions. A good rule for defining the embedding size for a column is to divide the number of unique values in the column by 2 (but not more than 50).
The script below creates a tuple containing the number of unique values and dimension sizes for all quality columns.
The rule is simple: the embedding matrix must always be in the number of rows greater than the range of variables in the number of rows: that’s why I added col_size + 2, it’s a big stock.
categorical_column_sizes = [len(df[column].cat.categories) for column in categorical_columns]
categorical_embedding_sizes = [(col_size+2, min(50, (col_size+5)//2)) for col_size in categorical_column_sizes]
print(categorical_embedding_sizes)
Dzielenie zestawu na szkoleniowy i testowy¶
total_records = df['Age'].count()
test_records = int(total_records * .2)
categorical_train_data = categorical_data[:total_records-test_records]
categorical_test_data = categorical_data[total_records-test_records:total_records]
numerical_train_data = numerical_data[:total_records-test_records]
numerical_test_data = numerical_data[total_records-test_records:total_records]
train_outputs = outputs[:total_records-test_records]
test_outputs = outputs[total_records-test_records:total_records]
Aby sprawdzić, czy poprawnie podzieliliśmy dane na zestawy treningów i testów, wydrukujmy długości rekordów szkolenia i testów:
print('categorical_train_data: ',categorical_train_data.shape)
print('numerical_train_data: ',numerical_train_data.shape)
print('train_outputs: ', train_outputs.shape)
print('----------------------------------------------------')
print('categorical_test_data: ',categorical_test_data.shape)
print('numerical_test_data: ',numerical_test_data.shape)
print('test_outputs: ',test_outputs.shape)
Creating the Pytorch classification model¶
class Model(nn.Module):
def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
super().__init__()
self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
self.embedding_dropout = nn.Dropout(p)
self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)
all_layers = []
num_categorical_cols = sum((nf for ni, nf in embedding_size))
input_size = num_categorical_cols + num_numerical_cols
for i in layers:
all_layers.append(nn.Linear(input_size, i))
all_layers.append(nn.ReLU(inplace=True))
all_layers.append(nn.BatchNorm1d(i))
all_layers.append(nn.Dropout(p))
input_size = i
all_layers.append(nn.Linear(layers[-1], output_size))
self.layers = nn.Sequential(*all_layers)
def forward(self, x_categorical, x_numerical):
embeddings = []
for i,e in enumerate(self.all_embeddings):
embeddings.append(e(x_categorical[:,i]))
x = torch.cat(embeddings, 1)
x = self.embedding_dropout(x)
x_numerical = self.batch_norm_num(x_numerical)
x = torch.cat([x, x_numerical], 1)
x = self.layers(x)
return x
print('categorical_embedding_sizes: ',categorical_embedding_sizes)
print(numerical_data.shape[1])
model = Model(categorical_embedding_sizes, numerical_data.shape[1], 2, [200,100,50], p=0.4)
print(model)
Creating a loss function¶
#loss_function = torch.nn.MSELoss(reduction='sum')
loss_function = nn.CrossEntropyLoss()
#loss_function = nn.BCEWithLogitsLoss()
Defining the optimizer¶
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
#optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
#optimizer = torch.optim.Rprop(model.parameters(), lr=0.001, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
print('categorical_embedding_sizes: ',categorical_embedding_sizes)
print(numerical_data.shape[1])
print('categorical_train_data: ',categorical_train_data.shape)
print('numerical_train_data: ',numerical_train_data.shape)
print('outputs: ',train_outputs.shape)
y_pred = model(categorical_train_data, numerical_train_data)
epochs = 1600
aggregated_losses = []
for i in range(epochs):
i += 1
y_pred = model(categorical_train_data, numerical_train_data)
single_loss = loss_function(y_pred, train_outputs)
aggregated_losses.append(single_loss)
if i
print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')
optimizer.zero_grad()
single_loss.backward()
optimizer.step()
print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')
plt.plot(range(epochs), aggregated_losses, color='r')
plt.ylabel('Loss')
plt.xlabel('epoch');
Forecast based on the model¶
with torch.no_grad():
y_val_train = model(categorical_train_data, numerical_train_data)
loss = loss_function( y_val_train, train_outputs)
print(f'Loss train_set: {loss:.8f}')
with torch.no_grad():
y_val = model(categorical_test_data, numerical_test_data)
loss = loss_function(y_val, test_outputs)
print(f'Loss: {loss:.8f}')
Because we’ve determined that our output layer will contain 2 neurons, each forecast will contain 2 values. For example, the first 5 predicted values are:
print(y_val[:5])
The purpose of such forecasts is that if the actual result is 0, the value at index 0 should be higher than the value at index 1 and vice versa. We can get the largest value index from the list using the following script:
y_val = np.argmax(y_val, axis=1)
The above equation returns the maximum value indicators along the axis.
print(y_val[:195])
Because in the list of originally predicted results for the first five records, the values at zero indexes are greater than the values at the first indexes, we can see 0 in the first five rows of processed output.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(test_outputs,y_val))
print(classification_report(test_outputs,y_val))
print(accuracy_score(test_outputs, y_val))
We save the whole model¶
torch.save(model,'/home/wojciech/Pulpit/3/byk2.pb')
We play the whole model¶
KOT = torch.load('/home/wojciech/Pulpit/3/byk2.pb')
KOT.eval()
By substituting other independent variables, one can get the vector of output variables¶
A = categorical_train_data[::5]
A
B = numerical_train_data[::5]
B
y =train_outputs[::5]
y_pred_AB = KOT(A, B)
y_pred_AB[:10]
with torch.no_grad():
y_val_AB = KOT(A,B)
loss = loss_function( y_val_AB, y)
print(f'Loss train_set: {loss:.8f}')
y_val = np.argmax(y_val_AB, axis=1)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y,y_val))
print(classification_report(y,y_val))
print(accuracy_score(y, y_val))
print('Measuring the time to complete this task:')
print((time.time() - start_time)/60) ## koniec pomiaru czasu
