
300320202100
RFECV differs from Recursive Feature Elimination (RFE) in the function selection process in that it indicates the OPTIMAL NUMBER OF VARIABLES and not the designated number of best variables.
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
## colorful prints
def black(text):
print('33[30m', text, '33[0m', sep='')
def red(text):
print('33[31m', text, '33[0m', sep='')
def green(text):
print('33[32m', text, '33[0m', sep='')
def yellow(text):
print('33[33m', text, '33[0m', sep='')
def blue(text):
print('33[34m', text, '33[0m', sep='')
def magenta(text):
print('33[35m', text, '33[0m', sep='')
def cyan(text):
print('33[36m', text, '33[0m', sep='')
def gray(text):
print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
Out[3]:
Deleting unneeded columns¶
In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']
del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
In [6]:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
Deletes duplicates¶
there were no duplicates
In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
In [8]:
blue(df.dtypes)
In [9]:
df.columns
Out[9]:
We choose the continuous variable – compactness_mean¶
In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())
sns.distplot(np.array(df['compactness_mean']))
Out[10]:
Recursive Feature Elimination and cross-validated selection (RFECV)¶
In [11]:
X = df.drop('compactness_mean', axis=1)
y = df['compactness_mean']
I set the minimum number of variables that will remain in the model¶
In [12]:
min_v = 2
In [26]:
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
estimator = SVR(kernel="linear")
RCV = RFECV(estimator, step=1,min_features_to_select=min_v, cv=5)
RCV = RCV.fit(X, y)
RCV.support_
print('The mask of selected features: ',RCV.support_)
print()
print('The feature ranking:',RCV.ranking_)
print()
print('The external estimator:',RCV.estimator_)
print("Optimal number of features : # Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(RCV.grid_scores_) + 1), RCV.grid_scores_)
plt.show()
The RFECV algorithm checked all combinations and showed on the graph that the number of 15 variables was optimal.
Metoda zip na wyświetlenie rankingu cech¶
In [14]:
PPS = RCV.ranking_
KOT_MIC = dict(zip(df, PPS))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)
for r in KOT_sorted_keys_MIC:
print (r, KOT_MIC[r])
In [15]:
new_cols = X.columns[RCV.support_]
In [16]:
df2 = df[new_cols]
blue(df2.shape)
df2.head(3)
Out[16]:
We’re adding a result variable¶
In [17]:
df2['compactness_mean'] = df['compactness_mean']
df2.head(3)
Out[17]:
The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.
OLS linear regression model for variables before reduction¶
In [18]:
blue(df.shape)
In [19]:
X1 = df.drop('compactness_mean', axis=1)
y1 = df['compactness_mean']
In [20]:
from statsmodels.formula.api import ols
import statsmodels.api as sm
model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()
print('R2: #blue(model_fit.summary())
OLS linear regression model for variables after reduction¶
In [21]:
blue(df2.shape)
In [22]:
X2 = df2.drop('compactness_mean', axis=1)
y2 = df2['compactness_mean']
In [23]:
from statsmodels.formula.api import ols
import statsmodels.api as sm
model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()
print('R2: #blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')