# Feature Selection Techniques (by filter methods): categorical input, categorical output

## categorical input – categorical output

260320201223

In this case, statistical methods are used:
We always have continuous and discrete variables in the data set.
This procedure applies to the relations of discrete independent variables in relation to discrete result variables.
Below I show the analysis of discrete variables when the resulting value is discrete.

Preparation of discrete data for the procedure

data from the Titanic disaster

In [1]:
import pandas as pd

print()
print(df.shape)
print()
print(df.columns)

(891, 13)

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')


### We choose only discrete variables¶

Because we are testing discrete data with discrete data, there must be only discrete variables of type ‘object’ in the test set.
The following code throws out all independent variables identified by Pandas as continuous. This must be verified and possibly converted to a discrete format.
The ‘object’ format is the most economical of formats in terms of memory usage.

In [2]:
import numpy as np

continuous_vars = df.describe().columns
df[continuous_vars].agg(['nunique','dtypes','min','max','median'])

Out[2]:
Unnamed: 0 PassengerId Survived Pclass Age SibSp Parch Fare
nunique 891 891 2 3 88 7 7 248
dtypes int64 int64 int64 int64 float64 int64 int64 float64
min 0 1 0 1 0.42 0 0 0
max 890 891 1 3 80 8 6 512.329
median 445 446 0 3 28 0 0 14.4542

### We are replacing detected discrete variables with the ‘object’ format¶

In [3]:
df['Survived'] = df['Survived'].astype(object)
df['Pclass'] = df['Pclass'].astype(object)
df['SibSp'] = df['SibSp'].astype(object)
df['Parch'] = df['Parch'].astype(object)


### I can make a discrete variable out of the Age variable¶

In [4]:
Ewa = ['dziecko', 'młody','średni','starszy','stary']

df['Age2'] = pd.qcut(df['Age'],5, labels=Ewa)
df['Age2'] = df['Age2'].astype('object')

In [5]:
df[['Age','Age2']].sample(3)

Out[5]:
Age Age2
311 18.0 dziecko
712 48.0 stary
17 NaN NaN

Once again we check what could have been wrong.

In [6]:
import numpy as np
continuous_vars = df.describe().columns
df[continuous_vars].agg(['nunique','dtypes','min','max','median'])

Out[6]:
Unnamed: 0 PassengerId Age Fare
nunique 891 891 88 248
dtypes int64 int64 float64 float64
min 0 1 0.42 0
max 890 891 80 512.329
median 445 446 28 14.4542

## Checking categorical variables¶

In [7]:
categorical_vars = df.describe(include=["object"]).columns

import numpy as np

df[categorical_vars].agg(['nunique','dtypes','min','max','median'])

Out[7]:
Survived Pclass Name Sex SibSp Parch Ticket Cabin Embarked Age2
nunique 2 3 891 2 7 7 681 147 3 5
dtypes object object object object object object object object object object
min 0 1 Abbing, Mr. Anthony female 0 0 110152 NaN NaN NaN
max 1 3 van Melkebeke, Mr. Philemon male 8 6 WE/P 5735 NaN NaN NaN
median 0 3 NaN NaN 0 0 NaN NaN NaN NaN

## Checking and clearing empty records in discrete variables¶

In [8]:
df[categorical_vars].isnull().sum()

Out[8]:
Survived      0
Pclass        0
Name          0
Sex           0
SibSp         0
Parch         0
Ticket        0
Cabin       687
Embarked      2
Age2        177
dtype: int64

### Deleting unnecessary columns¶

In [9]:
del df['Cabin']
del df['Name']
del df['Ticket']

In [10]:
categorical_vars = df.describe(include=["object"]).columns
df[categorical_vars].isnull().sum()

Out[10]:
Survived      0
Pclass        0
Sex           0
SibSp         0
Parch         0
Embarked      2
Age2        177
dtype: int64
In [11]:
print(df.shape)
df=df.dropna(how='any')
print(df.shape)
df[categorical_vars].isnull().sum()

(891, 11)
(712, 11)

Out[11]:
Survived    0
Pclass      0
Sex         0
SibSp       0
Parch       0
Embarked    0
Age2        0
dtype: int64

## Last variable should be y¶

In [12]:
#del df['id']
categorical_vars

Out[12]:
Index(['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Age2'], dtype='object')
In [13]:
categorical_vars = ['Pclass','Sex','SibSp','Parch','Embarked','Age2','Survived']

In [14]:
print(df.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 11 columns):
Unnamed: 0     712 non-null int64
PassengerId    712 non-null int64
Survived       712 non-null object
Pclass         712 non-null object
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null object
Parch          712 non-null object
Fare           712 non-null float64
Embarked       712 non-null object
Age2           712 non-null object
dtypes: float64(2), int64(2), object(7)
memory usage: 284.4 KB
None

Out[14]:
Pclass Sex SibSp Parch Embarked Age2 Survived
0 3 male 1 0 S młody 0
1 1 female 1 0 C starszy 1
2 3 female 0 0 S średni 1

## Coding of discrete variables¶

In [15]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder

df2 = df[categorical_vars].apply(LabelEncoder().fit_transform)

Out[15]:
Pclass Sex SibSp Parch Embarked Age2 Survived
0 2 1 1 0 2 1 0
1 0 0 1 0 0 2 1
2 2 0 0 0 2 4 1
3 0 0 1 0 2 2 1

## I’m creating an array with dataframe¶

In [16]:
## tworzę array z dataframe
dataset = df2.values
dataset[:3]

Out[16]:
array([[2, 1, 1, 0, 2, 1, 0],
[0, 0, 1, 0, 0, 2, 1],
[2, 0, 0, 0, 2, 4, 1]])

## Divides data into describing variables and a result variable¶

In [17]:
X = dataset[:, :-1]
y = dataset[:,-1]
y[:25]

Out[17]:
array([0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0,
1, 0, 0])

## We divide the data into a training and test set¶

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

Train (477, 6) (477,)
Test (235, 6) (235,)

In [19]:
X_train[:5]

Out[19]:
array([[2, 1, 0, 0, 0, 4],
[0, 0, 1, 0, 2, 2],
[1, 0, 0, 0, 2, 4],
[2, 0, 1, 1, 2, 0],
[2, 1, 0, 0, 2, 0]])
In [20]:
y_train[:35]

Out[20]:
array([0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0])

## Definition¶

In [43]:
# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest, y_pred):
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4))
print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, model.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, model.predict(Xtest)))

y_pred_proba = model.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
auc = metrics.roc_auc_score(ytest, y_pred)
plt.plot(fpr, tpr, label='Logistic Regression (auc = %0.2f)' % auc)
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)


# Model Logistic Regression on ordinary data¶

In [44]:
import numpy as np
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(X_train, y_train)
y_pred_LRC = LR_Grid.predict(X_test)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)

In [45]:
Classification_Assessment(LR_Grid ,X_train, y_train, X_test, y_test, y_pred_LRC)

Recall Training data:      0.6719
Precision Training data:   0.7771
----------------------------------------------------------------------
Recall Test data:          0.7396
Precision Test data:       0.7474
----------------------------------------------------------------------
Confusion Matrix Test data
[[115  24]
[ 25  71]]
----------------------------------------------------------------------
precision    recall  f1-score   support

0       0.82      0.83      0.82       139
1       0.75      0.74      0.74        96

accuracy                           0.79       235
macro avg       0.78      0.78      0.78       235
weighted avg       0.79      0.79      0.79       235


auc 0.7834607314148682


# Mutual Information Feature Selection: Mutual_info_classif ()¶

Mmutual Information is usually used in the construction of decision trees for selecting variables.

In [24]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest

def select_features_MIC(X_train, y_train, X_test):
MIC = SelectKBest(score_func=mutual_info_classif, k='all')
MIC.fit(X_train, y_train)
X_train_MIC = MIC.transform(X_train)
X_test_MIC = MIC.transform(X_test)
return X_train_MIC, X_test_MIC, MIC

In [25]:
import time
start_time = time.time() ## pomiar czasu: start pomiaru czasu
print(time.ctime())

X_train_MIC, X_test_MIC, MIC = select_features_MIC(X_train, y_train, X_test)

print('minutes: ',
(time.time() - start_time)/60) ## koniec pomiaru czasu

Thu Mar 26 11:54:57 2020
minutes:  0.0006418029467264811


## Printing the results of the Mutual_info_classif () selection (the highest value is the best value)¶

In [26]:
df2.columns

Out[26]:
Index(['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Age2', 'Survived'], dtype='object')
In [27]:
for i in range(len(MIC.scores_)):
print('Feature %d: %f' % (i, MIC.scores_[i]))

Feature 0: 0.066209
Feature 1: 0.175240
Feature 2: 0.029764
Feature 3: 0.010300
Feature 4: 0.041607
Feature 5: 0.000000

In [28]:
importance_mic = np.round(MIC.scores_, decimals=3)
importance_mic

Out[28]:
array([0.066, 0.175, 0.03 , 0.01 , 0.042, 0.   ])
In [29]:
KOT_MIC = dict(zip(df2, importance_mic))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)

for r in KOT_sorted_keys_MIC:
print (r, KOT_MIC[r])

Sex 0.175
Pclass 0.066
Embarked 0.042
SibSp 0.03
Parch 0.01
Age2 0.0

In [30]:
KOT_MIC

Out[30]:
{'Pclass': 0.066,
'Sex': 0.175,
'SibSp': 0.03,
'Parch': 0.01,
'Embarked': 0.042,
'Age2': 0.0}
In [31]:
import matplotlib.pyplot as plt

plt.bar(*zip(*KOT_MIC.items()))
plt.xticks(rotation=90)
plt.show

Out[31]:
<function matplotlib.pyplot.show(*args, **kw)>
In [32]:
import seaborn as sns

sns.barplot(list(KOT_MIC.keys()), list(KOT_MIC.values()))
plt.title('Mutual Information Feature Selection')
plt.ylabel('variance')
plt.xlabel('independent variables')
plt.xticks(rotation=90)

Out[32]:
(array([0, 1, 2, 3, 4, 5]), <a list of 6 Text xticklabel objects>)

## Logistic regression for Mutual Information¶

Logistic regression is a good model for testing feature selection methods because it can work better if irrelevant features are removed from the model.

In [33]:
def red(text):
print('33[31m', text, '33[0m', sep='')

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

MIC_model = LogisticRegression(solver='lbfgs')
MIC_model.fit(X_train_MIC, y_train)
# evaluate the model
yMIC = MIC_model.predict(X_test_MIC)

In [34]:
Classification_Assessment(MIC_model ,X_train_MIC, y_train, X_test_MIC, y_test, yMIC)

Recall Training data:      0.6823
Precision Training data:   0.7939
----------------------------------------------------------------------
Recall Test data:          0.7188
Precision Test data:       0.7582
----------------------------------------------------------------------
Confusion Matrix Test data
[[117  22]
[ 27  69]]
----------------------------------------------------------------------
precision    recall  f1-score   support

0       0.81      0.84      0.83       139
1       0.76      0.72      0.74        96

accuracy                           0.79       235
macro avg       0.79      0.78      0.78       235
weighted avg       0.79      0.79      0.79       235


auc 0.780238309352518


# Feature Selection by Chi-Squared¶

In [35]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

def select_features_CH2(X_train, y_train, X_test):
CH2 = SelectKBest(score_func=chi2, k=4)
CH2.fit(X_train, y_train)
X_train_CH2 = CH2.transform(X_train)
X_test_CH2 = CH2.transform(X_test)
return X_train_CH2, X_test_CH2, CH2

In [36]:
import time
start_time = time.time() ## pomiar czasu: start pomiaru czasu
print(time.ctime())

X_train_CH2, X_test_CH2, CH2 = select_features_CH2(X_train, y_train, X_test)

print('minutes: ',
(time.time() - start_time)/60) ## koniec pomiaru czasu

Thu Mar 26 11:54:58 2020
minutes:  4.2319297790527344e-05


### Printing the results of the Chi-Squared selection (the highest value is the best value)¶

In [37]:
df2.columns

Out[37]:
Index(['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'Age2', 'Survived'], dtype='object')
In [38]:
for i in range(len(CH2.scores_)):
print('Feature %d: %f' % (i, CH2.scores_[i]))

Feature 0: 35.246035
Feature 1: 47.804273
Feature 2: 0.612330
Feature 3: 5.163853
Feature 4: 5.452485
Feature 5: 0.016795

In [39]:
importance_CH2 = np.round(CH2.scores_, decimals=3)
importance_CH2

KOT_CH2 = dict(zip(df2, importance_CH2))
KOT_CH2_sorted_keys = sorted(KOT_CH2, key=KOT_CH2.get, reverse=True)

for r in KOT_CH2_sorted_keys:
print (r, KOT_CH2[r])

Sex 47.804
Pclass 35.246
Embarked 5.452
Parch 5.164
SibSp 0.612
Age2 0.017

In [40]:
import seaborn as sns

sns.barplot(list(KOT_CH2.keys()), list(KOT_CH2.values()))

plt.title('Chi-Squared Feature Selection')
plt.ylabel('variance')
plt.xlabel('independent variables')

plt.xticks(rotation=90)

Out[40]:
(array([0, 1, 2, 3, 4, 5]), <a list of 6 Text xticklabel objects>)

## Logistic regression for Chi-Squared¶

In [41]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

CH2_model = LogisticRegression(solver='lbfgs')
CH2_model.fit(X_train_CH2, y_train)
# evaluate the model
yCH2 = CH2_model.predict(X_test_CH2)

In [42]:
Classification_Assessment(CH2_model ,X_train_CH2, y_train, X_test_CH2, y_test, yCH2)

Recall Training data:      0.6562
Precision Training data:   0.773
----------------------------------------------------------------------
Recall Test data:          0.7188
Precision Test data:       0.7263
----------------------------------------------------------------------
Confusion Matrix Test data
[[113  26]
[ 27  69]]
----------------------------------------------------------------------
precision    recall  f1-score   support

0       0.81      0.81      0.81       139
1       0.73      0.72      0.72        96

accuracy                           0.77       235
macro avg       0.77      0.77      0.77       235
weighted avg       0.77      0.77      0.77       235


auc 0.7658498201438849