Feature Selection Techniques – Backward Elimination

March 30, 2020 admin Feature Selection Techniques 0

5e74bc7229223_p

300320201313

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

In [2]:

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

In [3]:

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

Out[3]:

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	…	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.8	1001.0	0.11840	0.27760	0.3001	0.14710	…	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	…	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.0	1203.0	0.10960	0.15990	0.1974	0.12790	…	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NaN

3 rows × 33 columns

Deleting unneeded columns¶

In [4]:

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

In [5]:

df.isnull().sum()

Out[5]:

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

In [6]:

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fba4f05f3d0>

Deletes duplicates¶

there were no duplicates

In [7]:

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

In [8]:

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

In [9]:

df.columns

Out[9]:

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

In [10]:

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fba4bc79750>

Backward Elimination¶

In [18]:

x = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

In [21]:

cols=list(x.columns)
pmax=1
while (len(cols)>0):
    p=[]
    x_1 = x[cols]
    x_1 = sm.add_constant(x_1)
    model=sm.OLS(y,x_1).fit()
    p=pd.Series(model.pvalues.values[1:],index=cols)
    pmax=max(p)
    features_with_p_max=p.idxmax()
    if(pmax>0.05):
        cols.remove(features_with_p_max)
    else:
        break
new_cols=cols
print(new_cols)

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

In [23]:

df2 = df[new_cols]
blue(df.shape)
df2.head(3)

(569, 33)

Out[23]:

	radius_mean	perimeter_mean	area_mean	smoothness_mean	concavity_mean	symmetry_mean	fractal_dimension_mean	texture_se	compactness_se	concave points_se	fractal_dimension_se	radius_worst	perimeter_worst	compactness_worst	concavity_worst	symmetry_worst	fractal_dimension_worst	concave_points_se
0	17.99	122.8	1001.0	0.11840	0.3001	0.2419	0.07871	0.9053	0.04904	0.01587	0.006193	25.38	184.6	0.6656	0.7119	0.4601	0.11890	0.01587
1	20.57	132.9	1326.0	0.08474	0.0869	0.1812	0.05667	0.7339	0.01308	0.01340	0.003532	24.99	158.8	0.1866	0.2416	0.2750	0.08902	0.01340
2	19.69	130.0	1203.0	0.10960	0.1974	0.2069	0.05999	0.7869	0.04006	0.02058	0.004571	23.57	152.5	0.4245	0.4504	0.3613	0.08758	0.02058

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction¶

In [12]:

blue(df.shape)

(569, 33)

In [13]:

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

In [14]:

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: #blue(model_fit.summary())

R2: 0.980200

Copyright © 2024 | WordPress Theme by MH Themes