Feature Selection Techniques - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/category/models/feature-selection-techniques/ Wojciech Moszczyński Thu, 09 Apr 2020 09:54:38 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Feature Selection Techniques - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/category/models/feature-selection-techniques/ 32 32 Homemade loop to search for the best functions for the regression model (Feature Selection Techniques) https://sigmaquality.pl/models/feature-selection-techniques/homemade-loop-to-search-for-the-best-functions-for-the-regression-model-feature-selection-techniques-090420201150/ Thu, 09 Apr 2020 09:52:05 +0000 http://sigmaquality.pl/homemade-loop-to-search-for-the-best-functions-for-the-regression-model-feature-selection-techniques-090420201150/ 090420201150 In [1]: import pandas as pd df = pd.read_csv('/home/wojciech/Pulpit/1/tit_train.csv', na_values="-1") df.head(2) Out[1]: Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare [...]

Artykuł Homemade loop to search for the best functions for the regression model (Feature Selection Techniques) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
090420201150

In [1]:
import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/tit_train.csv', na_values="-1")
df.head(2)
Out[1]:
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C

I started with a loop connecting two functions in pairs

In [2]:
## ile jest zmiennych
a,b = df.shape     #<- ile mamy kolumn
b
Out[2]:
13
In [3]:
for i in range(1,b):
    i = df.columns[i]
    for f in range (1,b):
        f = df.columns[f]
        print(i,f)
       
PassengerId PassengerId
PassengerId Survived
PassengerId Pclass
PassengerId Name
PassengerId Sex
PassengerId Age
PassengerId SibSp
PassengerId Parch
PassengerId Ticket
PassengerId Fare
PassengerId Cabin
PassengerId Embarked
Survived PassengerId
Survived Survived
Survived Pclass
Survived Name
Survived Sex
Survived Age
Survived SibSp
Survived Parch
Survived Ticket
Survived Fare
Survived Cabin
Survived Embarked
Pclass PassengerId
Pclass Survived
Pclass Pclass
Pclass Name
Pclass Sex
Pclass Age
Pclass SibSp
Pclass Parch
Pclass Ticket
Pclass Fare
Pclass Cabin
Pclass Embarked
Name PassengerId
Name Survived
Name Pclass
Name Name
Name Sex
Name Age
Name SibSp
Name Parch
Name Ticket
Name Fare
Name Cabin
Name Embarked
Sex PassengerId
Sex Survived
Sex Pclass
Sex Name
Sex Sex
Sex Age
Sex SibSp
Sex Parch
Sex Ticket
Sex Fare
Sex Cabin
Sex Embarked
Age PassengerId
Age Survived
Age Pclass
Age Name
Age Sex
Age Age
Age SibSp
Age Parch
Age Ticket
Age Fare
Age Cabin
Age Embarked
SibSp PassengerId
SibSp Survived
SibSp Pclass
SibSp Name
SibSp Sex
SibSp Age
SibSp SibSp
SibSp Parch
SibSp Ticket
SibSp Fare
SibSp Cabin
SibSp Embarked
Parch PassengerId
Parch Survived
Parch Pclass
Parch Name
Parch Sex
Parch Age
Parch SibSp
Parch Parch
Parch Ticket
Parch Fare
Parch Cabin
Parch Embarked
Ticket PassengerId
Ticket Survived
Ticket Pclass
Ticket Name
Ticket Sex
Ticket Age
Ticket SibSp
Ticket Parch
Ticket Ticket
Ticket Fare
Ticket Cabin
Ticket Embarked
Fare PassengerId
Fare Survived
Fare Pclass
Fare Name
Fare Sex
Fare Age
Fare SibSp
Fare Parch
Fare Ticket
Fare Fare
Fare Cabin
Fare Embarked
Cabin PassengerId
Cabin Survived
Cabin Pclass
Cabin Name
Cabin Sex
Cabin Age
Cabin SibSp
Cabin Parch
Cabin Ticket
Cabin Fare
Cabin Cabin
Cabin Embarked
Embarked PassengerId
Embarked Survived
Embarked Pclass
Embarked Name
Embarked Sex
Embarked Age
Embarked SibSp
Embarked Parch
Embarked Ticket
Embarked Fare
Embarked Cabin
Embarked Embarked

Using loops in place of gaps I insert values out of range

In [4]:
print('NUMBER OF EMPTY RECORDS vs. FULL RECORDS')
print('----------------------------------------')
for i in range(1,b):
    i = df.columns[i]
    r = df[i].isnull().sum()
    h = df[i].count()
   
    if r > 0:
        print(i,"--------",r,"--------",h) 
NUMBER OF EMPTY RECORDS vs. FULL RECORDS
----------------------------------------
Age -------- 177 -------- 714
Cabin -------- 687 -------- 204
Embarked -------- 2 -------- 889
In [5]:
df.fillna(-777, inplace=True)
In [6]:
df = df.dropna(how='any')
df.isnull().sum()
Out[6]:
Unnamed: 0     0
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
In [7]:
df.shape
Out[7]:
(891, 13)

Encodes discrete (categorical) variables

In [8]:
import numpy as np

a,b = df.shape     #<- ile mamy kolumn
b


print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue
    
DISCRETE FUNCTIONS CODED
------------------------
Name --- object
Sex --- object
Ticket --- object
Cabin --- object
Embarked --- object

I run the LinearRegression () model

In [9]:
y = df['Survived']
X = df.drop('Survived', axis=1)
In [10]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression() 
 

I create loops for two variables based on LinearRegression ()

In [11]:
c,b = df.shape     #<- ile mamy kolumn
print('b: ',b)

a = list(range(1,b))
print('a :', a)
b:  13
a : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
In [12]:
from sklearn import metrics
b= b-2


for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        
        y = df['Survived']       
        X = df.drop('Survived', axis=1)
        
        #a = X.columns[i]
        
        #b = X.columns[f]
        
        col = X.columns[[i,f]]   #<-- nazwy kolumn
        X = X[col]               #<-- FAKTYCZNE warianty zbioru X
        regr.fit(X, y)
        y_pred = regr.predict(X)
        R = regr.score(X, y)
        R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
        RR2 = R2+R
        if RR2 > 0.72:
        # print(' R2: 
            print(' RR2: 
        
 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

I create loops for three variables based on LinearRegression ()

In [13]:
from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
        
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            regr.fit(X, y)
            y_pred = regr.predict(X)
            R = regr.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
            RR2 = R2+R
            if RR2 >= 0.757:
       
                print(' RR2: 
        
 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

I create loops for four variables based on LinearRegression ()

In [14]:
from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
            for r in range (1,b):
                r = a[r]
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g,r]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            regr.fit(X, y)
            y_pred = regr.predict(X)
            R = regr.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
            RR2 = R2+R
            if RR2 >= 0.761:
       
                print(' RR2: 
        
 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

I am starting the RandomForestRegressor model

In [15]:
y = df['Survived']       
X = df.drop('Survived', axis=1)
print(X.shape)
print(y.shape)
(891, 12)
(891,)
In [16]:
from sklearn.ensemble import RandomForestRegressor

model_RFC1 = RandomForestRegressor().fit(X, y)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [17]:
from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
        
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            model_RFC1.fit(X, y)
            y_pred2 = model_RFC1.predict(X)
            R = model_RFC1.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred2))
            RR2 = R2+R
            if RR2 >= 1.05:
       
                print(' RR2: 
 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

My own tractor comes to similar conclusions as other tools in the series: Feature Selection Techniques.
Only that my tractor is probably faster in calculations …..

Artykuł Homemade loop to search for the best functions for the regression model (Feature Selection Techniques) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques [categorical result] – Step Forward Selection https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-categorical-result-step-forward-selection-010420201017/ Wed, 01 Apr 2020 08:18:59 +0000 http://sigmaquality.pl/feature-selection-techniques-categorical-result-step-forward-selection-010420201017/ 010420201017 Forward selection is an iterative method in which we start with no function in the model. In each iteration, we add a function that [...]

Artykuł Feature Selection Techniques [categorical result] – Step Forward Selection pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

010420201017

Forward selection is an iterative method in which we start with no function in the model. In each iteration, we add a function that best improves our model until adding a new variable improves the model’s performance.

In [12]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [13]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [14]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/qsar_oral_toxicity.csv', sep=';')
green(df.shape)
df.head(3)
(8991, 1025)
Out[14]:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.962 0.963 0.964 0.965 0.966 0.967 0.968 0.969 0.970 negative
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 negative
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 negative
2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 negative

3 rows × 1025 columns

I’m looking for empty cells¶

In [15]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fca62173d90>
In [16]:
null_value = df.isnull().sum(axis=0)
null_value[null_value != 0]
Out[16]:
Series([], dtype: int64)

Mark empty cells as -999

In [50]:
df.fillna(-999, inplace=True)
In [18]:
df.shape
Out[18]:
(8991, 1025)

Deletes duplicates

there were no duplicates

In [19]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(8991, 1025)
(8514, 1025)
In [20]:
blue(df.dtypes)
0            int64
0.1          int64
0.2          int64
0.3          int64
0.4          int64
             ...  
0.967        int64
0.968        int64
0.969        int64
0.970        int64
negative    object
Length: 1025, dtype: object
In [21]:
df.columns
Out[21]:
Index(['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9',
       ...
       '0.962', '0.963', '0.964', '0.965', '0.966', '0.967', '0.968', '0.969',
       '0.970', 'negative'],
      dtype='object', length=1025)

Encodes the resulting value

In [25]:
df['negative'] = pd.Categorical(df['negative']).codes
df['negative'].value_counts()
Out[25]:
0    7795
1     719
Name: negative, dtype: int64
In [27]:
df.rename(columns={'negative':'ident'}, inplace=True)
df['ident'].head(2)
Out[27]:
0    0
1    0
Name: ident, dtype: int8

Step Forward Selection

In [28]:
X = df.drop('ident', axis=1) 
y = df['ident']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.
I specify how many programs should indicate the best variables:

In [29]:
k_features = 15
In [30]:
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

LR = LogisticRegression()

sfs1 = sfs(LR,k_features = k_features, forward=True, floating=False, scoring='r2',verbose=2,cv=5)
sfs1 = sfs1.fit(X_train,y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1024 out of 1024 | elapsed:   39.9s finished

[2020-04-01 09:39:39] Features: 1/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1023 out of 1023 | elapsed:   43.9s finished

[2020-04-01 09:40:23] Features: 2/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1022 out of 1022 | elapsed:   46.6s finished

[2020-04-01 09:41:09] Features: 3/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1021 out of 1021 | elapsed:   50.6s finished

[2020-04-01 09:42:00] Features: 4/15 -- score: -0.0917097195802096[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1020 out of 1020 | elapsed:   51.1s finished

[2020-04-01 09:42:51] Features: 5/15 -- score: -0.06352720295000061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1019 out of 1019 | elapsed:   54.3s finished

[2020-04-01 09:43:45] Features: 6/15 -- score: -0.050378334885553946[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1018 out of 1018 | elapsed:   58.9s finished

[2020-04-01 09:44:44] Features: 7/15 -- score: -0.035404146912808354[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1017 out of 1017 | elapsed:  1.0min finished

[2020-04-01 09:45:45] Features: 8/15 -- score: -0.014731021991353432[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1016 out of 1016 | elapsed:  1.0min finished

[2020-04-01 09:46:46] Features: 9/15 -- score: 0.007752557690146133[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1015 out of 1015 | elapsed:  1.1min finished

[2020-04-01 09:47:49] Features: 10/15 -- score: 0.034005698374276985[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1014 out of 1014 | elapsed:  1.1min finished

[2020-04-01 09:48:54] Features: 11/15 -- score: 0.0415299552312852[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1013 out of 1013 | elapsed:  1.3min finished

[2020-04-01 09:50:10] Features: 12/15 -- score: 0.050864666848338125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1012 out of 1012 | elapsed:  1.3min finished

[2020-04-01 09:51:28] Features: 13/15 -- score: 0.058403788853600556[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1011 out of 1011 | elapsed:  1.3min finished

[2020-04-01 09:52:48] Features: 14/15 -- score: 0.062143619559723404[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1010 out of 1010 | elapsed:  1.3min finished

[2020-04-01 09:54:09] Features: 15/15 -- score: 0.06776823076716185
In [32]:
feat_cols =list(sfs1.k_feature_idx_)
print(feat_cols)
[0, 1, 2, 67, 93, 231, 426, 506, 512, 526, 558, 559, 696, 795, 939]
In [33]:
PPS = feat_cols

KOT_lasso = dict(zip(df, PPS))
KOT_sorted_keys_lasso = sorted(KOT_lasso, key=KOT_lasso.get, reverse=True)

for r in KOT_sorted_keys_lasso:
    print (r, (KOT_lasso[r]))
0.13 939
0.12 795
0.11 696
0.10 559
1 558
0.9 526
0.8 512
0.7 506
0.6 426
0.5 231
0.4 93
0.3 67
0.2 2
0.1 1
0 0
In [37]:
new_cols = df.columns[feat_cols]
new_cols
Out[37]:
Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

Creates a dataset with reduced columns¶

In [39]:
df2 = df[new_cols]
df2['ident']=df['ident']
df2.head(3)
Out[39]:
0 0.1 0.2 0.64 0.87 0.218 0.406 0.479 0.485 0.499 0.530 0.531 0.658 0.751 0.891 ident
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [40]:
# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest, y_pred):
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(classification_report(ytest, model.predict(Xtest)))
    
    y_pred_proba = model.predict_proba(Xtest)[::,1]
    fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
    auc = metrics.roc_auc_score(ytest, y_pred)
    plt.plot(fpr, tpr, label='Logistic Regression (auc = 
    plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
    plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.legend(loc=4)
    plt.plot([0, 1], [0, 1],'r--')
    plt.show()
    print('auc',auc)

Logistic regression model for variables before reduction

In [41]:
blue(df.shape)
(8514, 1025)
In [42]:
X1 = df.drop('ident', axis=1) 
y1 = df['ident']  


from sklearn.model_selection import train_test_split 

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.20, random_state=123,stratify=y1)
In [43]:
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X1_train,y1_train)
y1_pred = logmodel.predict(X1_test)
In [44]:
Classification_Assessment(logmodel ,X1_train, y1_train, X1_test, y1_test, y1_pred)
Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

auc 0.6866915223433825

Logistic regression model for variables after reduction

In [45]:
blue(df2.shape)
(8514, 16)
In [47]:
X2 = df2.drop('ident', axis=1) 
y2 = df2['ident']  


from sklearn.model_selection import train_test_split 

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.20, random_state=123,stratify=y2)
In [48]:
from sklearn.linear_model import LogisticRegression

logmodel2 = LogisticRegression()
logmodel2.fit(X2_train,y2_train)
y2_pred = logmodel2.predict(X2_test)
In [49]:
Classification_Assessment(logmodel2 ,X2_train, y2_train, X2_test, y2_test, y2_pred)
Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

Artykuł Feature Selection Techniques [categorical result] – Step Forward Selection pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Recursive Feature Elimination and cross-validated selection (RFECV) https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-recursive-feature-elimination-and-cross-validated-selection-rfecv-300320202100/ Mon, 30 Mar 2020 19:02:58 +0000 http://sigmaquality.pl/feature-selection-techniques-recursive-feature-elimination-and-cross-validated-selection-rfecv-300320202100/ 300320202100 RFECV differs from Recursive Feature Elimination (RFE) in the function selection process in that it indicates the OPTIMAL NUMBER OF VARIABLES and not the [...]

Artykuł Feature Selection Techniques – Recursive Feature Elimination and cross-validated selection (RFECV) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
300320202100

RFECV differs from Recursive Feature Elimination (RFE) in the function selection process in that it indicates the OPTIMAL NUMBER OF VARIABLES and not the designated number of best variables.
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

Deleting unneeded columns

In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb2d0717810>

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 33)
(569, 33)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb2c7c94ad0>

Recursive Feature Elimination and cross-validated selection (RFECV)

In [11]:
X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

I set the minimum number of variables that will remain in the model

In [12]:
min_v = 2
In [26]:
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR

estimator = SVR(kernel="linear")
RCV = RFECV(estimator, step=1,min_features_to_select=min_v, cv=5)
RCV = RCV.fit(X, y)
RCV.support_


print('The mask of selected features: ',RCV.support_)
print()
print('The feature ranking:',RCV.ranking_)
print()
print('The external estimator:',RCV.estimator_)



print("Optimal number of features : 

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(RCV.grid_scores_) + 1), RCV.grid_scores_)
plt.show()
OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

The RFECV algorithm checked all combinations and showed on the graph that the number of 15 variables was optimal.

Metoda zip na wyświetlenie rankingu cech

In [14]:
PPS = RCV.ranking_

KOT_MIC = dict(zip(df, PPS))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)

for r in KOT_sorted_keys_MIC:
    print (r, KOT_MIC[r])
symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1
In [15]:
new_cols = X.columns[RCV.support_]
In [16]:
df2 = df[new_cols]
blue(df2.shape)
df2.head(3)
(569, 15)
Out[16]:
radius_mean texture_mean perimeter_mean concavity_mean symmetry_mean radius_se perimeter_se radius_worst texture_worst perimeter_worst compactness_worst concavity_worst concave points_worst symmetry_worst concave_points_worst
0 17.99 10.38 122.8 0.3001 0.2419 1.0950 8.589 25.38 17.33 184.6 0.6656 0.7119 0.2654 0.4601 0.2654
1 20.57 17.77 132.9 0.0869 0.1812 0.5435 3.398 24.99 23.41 158.8 0.1866 0.2416 0.1860 0.2750 0.1860
2 19.69 21.25 130.0 0.1974 0.2069 0.7456 4.585 23.57 25.53 152.5 0.4245 0.4504 0.2430 0.3613 0.2430

We’re adding a result variable

In [17]:
df2['compactness_mean'] = df['compactness_mean']
df2.head(3)
Out[17]:
radius_mean texture_mean perimeter_mean concavity_mean symmetry_mean radius_se perimeter_se radius_worst texture_worst perimeter_worst compactness_worst concavity_worst concave points_worst symmetry_worst concave_points_worst compactness_mean
0 17.99 10.38 122.8 0.3001 0.2419 1.0950 8.589 25.38 17.33 184.6 0.6656 0.7119 0.2654 0.4601 0.2654 0.27760
1 20.57 17.77 132.9 0.0869 0.1812 0.5435 3.398 24.99 23.41 158.8 0.1866 0.2416 0.1860 0.2750 0.1860 0.07864
2 19.69 21.25 130.0 0.1974 0.2069 0.7456 4.585 23.57 25.53 152.5 0.4245 0.4504 0.2430 0.3613 0.2430 0.15990

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction

In [18]:
blue(df.shape)
(569, 33)
In [19]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [20]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200

OLS linear regression model for variables after reduction

In [21]:
blue(df2.shape)
(569, 16)
In [22]:
X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']  
In [23]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')
R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

Artykuł Feature Selection Techniques – Recursive Feature Elimination and cross-validated selection (RFECV) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Embedded Method (Lasso) https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-embedded-method-lasso-300320202027/ Mon, 30 Mar 2020 18:28:19 +0000 http://sigmaquality.pl/feature-selection-techniques-embedded-method-lasso-300320202027/ 300320202027 Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which [...]

Artykuł Feature Selection Techniques – Embedded Method (Lasso) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
300320202027

Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration. Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold. Here we will do feature selection using Lasso regularization. If the feature is irrelevant, lasso penalizes its coefficient and make it 0. Hence the features with coefficient = 0 are removed and the rest are taken.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

Deleting unneeded columns

In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f82c6742350>

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 33)
(569, 33)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f82c66b0090>

Lasso

In [11]:
X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

I set the number of variables that will remain in the model

In [12]:
Num_v = 15
In [13]:
from sklearn import linear_model

#rlasso = RandomizedLasso(alpha=0.025)

# Standaryzacja zmiennych

clf = linear_model.Lasso(alpha=0.1, positive=True)
clf.fit(X, y)


blue(clf.coef_)
print()
green(clf.intercept_)
print()
red(clf.score(X,y))
[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

The positive parameter, which on Truei forces the coefficients to be positive. In addition, setting alpha regularization to a value close to 0 (i.e., 0.001) causes Lasso to mimic linear regression without regularization.

Metoda zip na wyświetlenie rankingu cech

In [14]:
PPS = clf.coef_

KOT_lasso = dict(zip(df, PPS))
KOT_sorted_keys_lasso = sorted(KOT_lasso, key=KOT_lasso.get, reverse=True)

for r in KOT_sorted_keys_lasso:
    print (r, (KOT_lasso[r]))
texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

We’re adding a result variable

In [15]:
df2 = df[['compactness_mean','texture_worst','perimeter_se']]
df2.head(3)
Out[15]:
compactness_mean texture_worst perimeter_se
0 0.27760 17.33 8.589
1 0.07864 23.41 3.398
2 0.15990 25.53 4.585

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction

In [16]:
blue(df.shape)
(569, 33)
In [17]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [18]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200

OLS linear regression model for variables after reduction

In [19]:
blue(df2.shape)
(569, 3)
In [20]:
X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']  
In [22]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).')
R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

Artykuł Feature Selection Techniques – Embedded Method (Lasso) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Recursive Feature Elimination (RFE) https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-recursive-feature-elimination-rfe-300320201719/ Mon, 30 Mar 2020 15:23:27 +0000 http://sigmaquality.pl/feature-selection-techniques-recursive-feature-elimination-rfe-300320201719/ 300320201719 It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best [...]

Artykuł Feature Selection Techniques – Recursive Feature Elimination (RFE) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

300320201719

It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

Deleting unneeded columns

In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f31871d2350>

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 33)
(569, 33)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f317fdf2950>

Recursive Feature elimination

In [11]:
X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

I set the number of variables that will remain in the model

In [12]:
Num_v = 15
In [13]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

model=LinearRegression()
rfe=RFE(model,Num_v)

# Standaryzacja zmiennych

X_rfe = rfe.fit_transform(X,y)

model.fit(X_rfe,y)

print('Number of selected functions:  ',rfe.n_features_)
print()
print('The mask of selected features: ',rfe.support_)
print()
print('The feature ranking:',rfe.ranking_)
print()
print('The external estimator:',rfe.estimator_)
Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Metoda zip na wyświetlenie rankingu cech

In [14]:
PPS = rfe.ranking_

KOT_MIC = dict(zip(df, PPS))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)

for r in KOT_sorted_keys_MIC:
    print (r, KOT_MIC[r])
perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1
In [15]:
new_cols = X.columns[rfe.support_]
In [16]:
df2 = df[new_cols]
df2.head(3)
Out[16]:
smoothness_mean concavity_mean concave points_mean fractal_dimension_mean smoothness_se compactness_se concave points_se symmetry_se fractal_dimension_se smoothness_worst compactness_worst concavity_worst fractal_dimension_worst concave_points_se concave_points_mean
0 0.11840 0.3001 0.14710 0.07871 0.006399 0.04904 0.01587 0.03003 0.006193 0.1622 0.6656 0.7119 0.11890 0.01587 0.14710
1 0.08474 0.0869 0.07017 0.05667 0.005225 0.01308 0.01340 0.01389 0.003532 0.1238 0.1866 0.2416 0.08902 0.01340 0.07017
2 0.10960 0.1974 0.12790 0.05999 0.006150 0.04006 0.02058 0.02250 0.004571 0.1444 0.4245 0.4504 0.08758 0.02058 0.12790

We’re adding a result variable

In [17]:
df2['compactness_mean'] = df['compactness_mean']
df2.head(3)
Out[17]:
smoothness_mean concavity_mean concave points_mean fractal_dimension_mean smoothness_se compactness_se concave points_se symmetry_se fractal_dimension_se smoothness_worst compactness_worst concavity_worst fractal_dimension_worst concave_points_se concave_points_mean compactness_mean
0 0.11840 0.3001 0.14710 0.07871 0.006399 0.04904 0.01587 0.03003 0.006193 0.1622 0.6656 0.7119 0.11890 0.01587 0.14710 0.27760
1 0.08474 0.0869 0.07017 0.05667 0.005225 0.01308 0.01340 0.01389 0.003532 0.1238 0.1866 0.2416 0.08902 0.01340 0.07017 0.07864
2 0.10960 0.1974 0.12790 0.05999 0.006150 0.04006 0.02058 0.02250 0.004571 0.1444 0.4245 0.4504 0.08758 0.02058 0.12790 0.15990

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction

In [18]:
blue(df.shape)
(569, 33)
In [19]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [20]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200

OLS linear regression model for variables after reduction

In [21]:
blue(df2.shape)
(569, 16)
In [22]:
X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']  
In [23]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')
R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

Artykuł Feature Selection Techniques – Recursive Feature Elimination (RFE) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Backward Elimination https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-backward-elimination-300320201313/ Mon, 30 Mar 2020 11:14:46 +0000 http://sigmaquality.pl/feature-selection-techniques-backward-elimination-300320201313/ 300320201313 In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the [...]

Artykuł Feature Selection Techniques – Backward Elimination pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

300320201313

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

Deleting unneeded columns

In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fba4f05f3d0>

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 33)
(569, 33)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fba4bc79750>

Backward Elimination

In [18]:
x = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.
In [21]:
cols=list(x.columns)
pmax=1
while (len(cols)>0):
    p=[]
    x_1 = x[cols]
    x_1 = sm.add_constant(x_1)
    model=sm.OLS(y,x_1).fit()
    p=pd.Series(model.pvalues.values[1:],index=cols)
    pmax=max(p)
    features_with_p_max=p.idxmax()
    if(pmax>0.05):
        cols.remove(features_with_p_max)
    else:
        break
new_cols=cols
print(new_cols)
['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']
In [23]:
df2 = df[new_cols]
blue(df.shape)
df2.head(3)
(569, 33)
Out[23]:
radius_mean perimeter_mean area_mean smoothness_mean concavity_mean symmetry_mean fractal_dimension_mean texture_se compactness_se concave points_se fractal_dimension_se radius_worst perimeter_worst compactness_worst concavity_worst symmetry_worst fractal_dimension_worst concave_points_se
0 17.99 122.8 1001.0 0.11840 0.3001 0.2419 0.07871 0.9053 0.04904 0.01587 0.006193 25.38 184.6 0.6656 0.7119 0.4601 0.11890 0.01587
1 20.57 132.9 1326.0 0.08474 0.0869 0.1812 0.05667 0.7339 0.01308 0.01340 0.003532 24.99 158.8 0.1866 0.2416 0.2750 0.08902 0.01340
2 19.69 130.0 1203.0 0.10960 0.1974 0.2069 0.05999 0.7869 0.04006 0.02058 0.004571 23.57 152.5 0.4245 0.4504 0.3613 0.08758 0.02058

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction

In [12]:
blue(df.shape)
(569, 33)
In [13]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [14]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200

Artykuł Feature Selection Techniques – Backward Elimination pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques [numerical result] – Step Forward Selection https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-step-forward-selection-300320201248/ Mon, 30 Mar 2020 10:49:55 +0000 http://sigmaquality.pl/feature-selection-techniques-step-forward-selection-300320201248/ 300320201248 Forward selection is an iterative method in which we start with no function in the model. In each iteration, we add a function that [...]

Artykuł Feature Selection Techniques [numerical result] – Step Forward Selection pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
300320201248

Forward selection is an iterative method in which we start with no function in the model. In each iteration, we add a function that best improves our model until adding a new variable improves the model’s performance.
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
  id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

 

Deleting unneeded columns

In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f87ef7ed310>
 

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 33)
(569, 33)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')
 

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f87ec3f3050>
 

Step Forward Selection

In [11]:
X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.
 
I specify how many programs should indicate the best variables:

In [12]:
k_features = 16
In [13]:
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

LR = LinearRegression()

sfs1 = sfs(LR,k_features = k_features, forward=True, floating=False, scoring='r2',verbose=2,cv=5)
sfs1 = sfs1.fit(X_train,y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114
In [14]:
feat_cols =list(sfs1.k_feature_idx_)
print(feat_cols)
[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]
In [15]:
X.columns
Out[15]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')
In [16]:
new_cols = df.columns[feat_cols]
new_cols
Out[16]:
Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')
 

I create a dataset with reduced columns.

In [17]:
df2 = df[new_cols]
df2.head(3)
Out[17]:
  radius_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean smoothness_se symmetry_se fractal_dimension_se texture_worst smoothness_worst compactness_worst concave points_worst symmetry_worst
0 17.99 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.006399 0.03003 0.006193 17.33 0.1622 0.6656 0.2654 0.4601
1 20.57 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.005225 0.01389 0.003532 23.41 0.1238 0.1866 0.1860 0.2750
2 19.69 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.006150 0.02250 0.004571 25.53 0.1444 0.4245 0.2430 0.3613
 

OLS linear regression model for variables before reduction

In [18]:
blue(df.shape)
(569, 33)
In [19]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [20]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200
 

OLS linear regression model for variables after reduction

In [21]:
X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']  
In [22]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')
R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

Artykuł Feature Selection Techniques [numerical result] – Step Forward Selection pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Variance Inflation Factor (VIF) https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-techniques-variance-inflation-factor-vif-290320202006/ Sun, 29 Mar 2020 18:09:08 +0000 http://sigmaquality.pl/feature-selection-techniques-variance-inflation-factor-vif-290320202006/ 290320202006 Collinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. The Variance Inflation [...]

Artykuł Feature Selection Techniques – Variance Inflation Factor (VIF) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
290320202006

Collinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset.

The Variance Inflation Factor (VIF) technique from the Feature Selection Techniques collection is not intended to improve the quality of the model, but to remove the autocorrelation of independent variables.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

Deleting unneeded columns

In [4]:
df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbfd5915f10>

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 33)
(569, 33)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbfd42f2610>

Variance Inflation Factor (VIF)

In [11]:
import pandas as pd
import statsmodels.formula.api as smf

def get_vif(exogs, data):
    '''Return VIF (variance inflation factor) DataFrame

    Args:
    exogs (list): list of exogenous/independent variables
    data (DataFrame): the df storing all variables

    Returns:
    VIF and Tolerance DataFrame for each exogenous variable

    Notes:
    Assume we have a list of exogenous variable [X1, X2, X3, X4].
    To calculate the VIF and Tolerance for each variable, we regress
    each of them against other exogenous variables. For instance, the
    regression model for X3 is defined as:
                        X3 ~ X1 + X2 + X4
    And then we extract the R-squared from the model to calculate:
                    VIF = 1 / (1 - R-squared)
                    Tolerance = 1 - R-squared
    The cutoff to detect multicollinearity:
                    VIF > 10 or Tolerance < 0.1
    '''

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # create formula for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        formula = f"{exog} ~ {' + '.join(not_exog)}"

        # extract r-squared from the fit
        r_squared = smf.ols(formula, data=data).fit().rsquared

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif
In [12]:
# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif
In [13]:
df.columns
exogs =['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave_points_worst',
       'symmetry_worst', 'fractal_dimension_worst']
In [14]:
print('Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity' )

pks = sklearn_vif(exogs, df)
pks.sort_values('VIF').round(1)
print()
blue('LinearRegression in sklearn')
blue(pks[pks['VIF']<=10])


kot = get_vif(exogs, df)
kot.sort_values('VIF').round(1)
print()
green('LinearRegression in statasmodels')
green(kot[kot['VIF']<=10])
Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

OLS linear regression model for variables before reduction

In [15]:
blue(df.shape)
(569, 33)
In [16]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [17]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200

OLS linear regression model for variables after reduction

In [22]:
df2 =df[['smoothness_mean','symmetry_mean','texture_se','smoothness_se', 'fractal_dimension_se','symmetry_worst','compactness_mean']]
In [23]:
X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']  
In [24]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')
R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

Artykuł Feature Selection Techniques – Variance Inflation Factor (VIF) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Pearson correlation https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-by-filter-methods-pearson-correlation-290320201454/ Sun, 29 Mar 2020 12:55:16 +0000 http://sigmaquality.pl/feature-selection-by-filter-methods-pearson-correlation-290320201454/ 290320201454 In [1]: import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder, OneHotEncoder import warnings [...]

Artykuł Feature Selection Techniques – Pearson correlation pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
290320201454

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)
(569, 33)
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974 0.12790 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN

3 rows × 33 columns

Deleting unneeded columns

In [4]:
del df['Unnamed: 32']
del df['diagnosis']
del df['id']
In [5]:
df.isnull().sum()
Out[5]:
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64
In [6]:
import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7feb12ef9c10>

Deletes duplicates

there were no duplicates

In [7]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(569, 30)
(569, 30)
In [8]:
blue(df.dtypes)
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object
In [9]:
df.columns
Out[9]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

We choose the continuous variable – compactness_mean

In [10]:
print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))
max: 0.3454
min: 0.01938
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7feb0fac1410>

Pearson correlation

In [11]:
def matrix_plot(df,title):

    sns.set(style="ticks")

    corr = df.corr()
    corr = np.round(corr, decimals=2)


    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    f, ax = plt.subplots(figsize=(20, 20))
    #cmap = sns.diverging_palette(580, 10, as_cmap=True)
    cmap = sns.diverging_palette(180, 90, as_cmap=True) #Inna paleta barw

    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.3, center=0.03,annot=True,
                square=True, linewidths=.9, cbar_kws={"shrink": 0.8})
    plt.xticks(rotation=90)
    plt.title(title,fontsize=32,color='#0c343d',alpha=0.5)
    plt.show
In [12]:
matrix_plot(df,'Pearson correlation')

Correlation to the result variable

In [13]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
CORREL = df.corr().sort_values('compactness_mean')
CORREL['compactness_mean'].plot(kind='barh',color='#0c343d',alpha=0.5)
plt.title('Correlation to the result variable', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Out[13]:
Text(0, 0.5, 'Continuous independent variables')

I find variables that are highly correlated with the result variable

In [14]:
kot = abs(CORREL['compactness_mean'])
FAT = kot[kot>=0.7]
FAT
Out[14]:
compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

Compares variables in pairs

In [15]:
plt.barh(*zip(*FAT.items()),color='#0c343d',alpha=0.5) 
plt.xticks(rotation=90)
Out[15]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 <a list of 7 Text xticklabel objects>)

High autocorrelation chart

In [16]:
CORR = df.corr()

kot = CORR[CORR>=.9]
plt.figure(figsize=(6,4))
sns.heatmap(kot, cmap="Greens")
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7feb0f85e350>

Deleting correlated independent variables

The code we compare the correlation between variables and remove one of two features whose correlation is higher than 0.9

In [17]:
corr = df.corr()
kot = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
    for j in range(i+1, corr.shape[0]):
        if corr.iloc[i,j] >= 0.9:
            if kot[j]:
                kot[j] = False
selected_columns = df.columns[kot]
df2 = df[selected_columns]
In [18]:
kot   #<== PĘTLA ZROBIŁA NAM wektor 31 elementów True- False
Out[18]:
array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

Dimensions have been reduced

In [19]:
blue(df.shape)
green(df2.shape)
(569, 30)
(569, 20)

OLS linear regression model for variables before reduction

In [20]:
blue(df.shape)
green(df2.shape)
(569, 30)
(569, 20)
In [21]:
X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']  
In [22]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
R2: 0.980200

OLS linear regression model for variables after reduction

In [23]:
X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']  
In [24]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')
R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

Eliminates variables previously selected in the FAT procedure

In [25]:
FAT
Out[25]:
compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64
In [26]:
df3 = df.drop(['compactness_se','concave points_worst','concavity_worst','concave points_mean','compactness_worst','concavity_mean'],1)
In [27]:
X3 = df3.drop('compactness_mean', axis=1) 
y3 = df3['compactness_mean']  
In [28]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y3, sm.add_constant(X3))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')
R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

Artykuł Feature Selection Techniques – Pearson correlation pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques (by filter methods): numerical_ input, categorical output https://sigmaquality.pl/models/feature-selection-techniques/feature-selection-by-filter-methods-numerical_-input-categorical-output-280320200940/ Sat, 28 Mar 2020 08:41:26 +0000 http://sigmaquality.pl/feature-selection-by-filter-methods-numerical_-input-categorical-output-280320200940/ 280320200940 Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality In this case, statistical methods are used: We always have continuous and discrete variables in the data set. This procedure [...]

Artykuł Feature Selection Techniques (by filter methods): numerical_ input, categorical output pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

In this case, statistical methods are used:
We always have continuous and discrete variables in the data set.
This procedure applies to the relations of discrete independent variables in relation to discrete result variables.
Below I show the analysis of numerical variables when the resulting value is discrete.

How to Choose a Feature Selection Method For Machine Learning

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)
In [2]:
##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')
In [3]:
df = pd.read_csv ('/home/wojciech/Pulpit/1/AirQualityUCI.csv', sep=';',nrows=1000)
green(df.shape)
df.head(3)
(1000, 17)
Out[3]:
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2,6 1360 150 11,9 1046 166 1056 113 1692 1268 13,6 48,9 0,7578 NaN NaN
1 10/03/2004 19.00.00 2 1292 112 9,4 955 103 1174 92 1559 972 13,3 47,7 0,7255 NaN NaN
2 10/03/2004 20.00.00 2,2 1402 88 9,0 939 131 1140 114 1555 1074 11,9 54,0 0,7502 NaN NaN

Usuwanie niepotrzebnych kolumn

In [4]:
del df['Unnamed: 15']
del df['Unnamed: 16']

Kasuje brakujące rekordy

In [5]:
green(df.shape)
df.isnull().sum()
df = df.dropna(how='any')
blue(df.shape)
blue(df.isnull().sum())
(1000, 15)
(1000, 15)
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Kasuje duplikaty

nie było duplikatów

In [6]:
green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)
(1000, 15)
(1000, 15)

Z daty wyciągam dzień tygodnia, miesiąc, oraz godzinę jako zmienne ciągłe

In [7]:
df['Date'] = pd.to_datetime(df.Date)
df['day'] = df['Date'].dt.weekday
df['month'] = df['Date'].dt.month
df['hour'] = df['Time'].str.slice(0,2)
df[['Date','day','month','hour']].head(3)
Out[7]:
Date day month hour
0 2004-10-03 6 10 18
1 2004-10-03 6 10 19
2 2004-10-03 6 10 20
In [8]:
del df['Date']
del df['Time']

Kasuje zmienną -200 oznaczającą błąd danych

In [9]:
df[['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']] = df[['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']].replace(-200,np.NaN)
In [10]:
df.isnull().sum()
Out[10]:
CO(GT)             0
PT08.S1(CO)       27
NMHC(GT)         274
C6H6(GT)           0
PT08.S2(NMHC)     27
NOx(GT)          206
PT08.S3(NOx)      27
NO2(GT)          206
PT08.S4(NO2)      27
PT08.S5(O3)       27
T                  0
RH                 0
AH                 0
day                0
month              0
hour               0
dtype: int64
In [11]:
del df['NMHC(GT)']
green(df.shape)
df.isnull().sum()
df = df.dropna(how='any')
blue(df.shape)
blue(df.isnull().sum())
(1000, 15)
(768, 15)
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
day              0
month            0
hour             0
dtype: int64

Zamieniam zmienne na wartości numeryczne

In [12]:
blue(df.dtypes)
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
day                int64
month              int64
hour              object
dtype: object

Macierz korelacji

In [13]:
df['CO(GT)'] = df['CO(GT)'].str.replace(',', '.')
In [14]:
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',', '.')
In [15]:
df['T'] = df['T'].str.replace(',', '.')
In [16]:
df['RH'] = df['RH'].str.replace(',', '.')
In [17]:
df['AH'] = df['AH'].str.replace(',', '.')
In [18]:
df[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']] = df[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']].astype(float)
In [19]:
CORREL = df.corr()
plt.figure(figsize=(10,6))
sns.heatmap(CORREL, annot=True, cbar=False, cmap="coolwarm")
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcdc2507d90>

Koduje zmienną kategoryczną wynikową – C6H6(GT)

In [20]:
print('max:',df['C6H6(GT)'].max())
print('min:',df['C6H6(GT)'].min())

sns.distplot(np.array(df['C6H6(GT)']))
max: 39.2
min: 0.5
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fcdbef8a810>
In [21]:
df['C6H6(GT)'] = df['C6H6(GT)'].apply(lambda x: 1 if x > 10 else 0)
df['C6H6(GT)'].value_counts()
Out[21]:
0    446
1    322
Name: C6H6(GT), dtype: int64
In [22]:
df['C6H6(GT)'] = pd.Categorical(df['C6H6(GT)']).codes
df['C6H6(GT)'].value_counts()
Out[22]:
0    446
1    322
Name: C6H6(GT), dtype: int64

Model regresji liniowej bez redukcji zmiennych

In [23]:
blue(df.dtypes)
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object
In [24]:
X = df.drop('C6H6(GT)', axis=1) 
y = df['C6H6(GT)']  

Podział na dane treningowe i testowe

In [25]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

Definicje

In [26]:
# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest, y_pred):
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(classification_report(ytest, model.predict(Xtest)))
    
    y_pred_proba = model.predict_proba(Xtest)[::,1]
    fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
    auc = metrics.roc_auc_score(ytest, y_pred)
    plt.plot(fpr, tpr, label='Logistic Regression (auc = 
    plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
    plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.legend(loc=4)
    plt.plot([0, 1], [0, 1],'r--')
    plt.show()
    print('auc',auc)
In [27]:
blue(X.shape)
green(X_train.shape)
green(X_test.shape)
(768, 14)
(614, 14)
(154, 14)

Modelu klasyfikacji bez wyboru funkcji

In [28]:
import numpy as np
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(X_train, y_train) 
y_pred_LRC = LR_Grid.predict(X_test)
In [29]:
Classification_Assessment(LR_Grid ,X_train, y_train, X_test, y_test, y_pred_LRC)
Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

Redukcja zmiennych niezależnych za pomocą OLS

In [30]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

blue(model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [31]:
p_values = model_fit.summary2().tables[1]['P>|t|']
## zaokrąglam


p_values = np.round(p_values, decimals=2)
p_values= p_values.sort_values()

plt.figure(figsize=(3,8))
p_values.plot(kind='barh')
plt.title('p-value for independent variables in OLS')
plt.grid(True)
plt.ylabel('independent variables')
plt.xlabel('p-value')
plt.xticks(rotation=90)
Out[31]:
(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), <a list of 6 Text xticklabel objects>)

Wybieramy zmienne z p-value < 0.1

In [32]:
df.columns
Out[32]:
Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')
In [33]:
df2= df[['PT08.S4(NO2)','PT08.S3(NOx)','PT08.S2(NMHC)','AH','C6H6(GT)']]
In [34]:
y= y.to_frame()
y.head(4)
Out[34]:
C6H6(GT)
0 1
1 0
2 0
3 0
In [35]:
fig = plt.figure(figsize = (20, 25))
j = 0
for i in df2.columns:
    plt.subplot(6, 4, j+1)
    j = 1+j
    sns.distplot(df2[i][y['C6H6(GT)']==0], color='#999999', label = '0')
    sns.distplot(df2[i][y['C6H6(GT)']==1], color='#ff0000', label = '1')
    plt.legend(loc='best',fontsize=10)
fig.suptitle('Classification charts',fontsize=34,color='#ff0000',alpha=0.3)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()
In [36]:
def scientist_plot(data, y, AAA, Title):
    fig = plt.figure(figsize = (20, 25))
    j = 0
    for i in df2.columns:
        plt.subplot(6, 4, j+1)
        j = 1+j
        sns.distplot(data[i][y[AAA]==0], color='#999999', label = '0')
        sns.distplot(data[i][y[AAA]==1], color='#274e13', label = '1')
        plt.legend(loc='best',fontsize=10)
    fig.suptitle(Title,fontsize=34,color='#274e13',alpha=0.5)
    fig.tight_layout()
    fig.subplots_adjust(top=0.95)
    plt.show()
In [37]:
scientist_plot(df2, y, 'C6H6(GT)','Classification charts')
In [38]:
fig = plt.figure(figsize = (20, 25))
kot = ['#999999','#274e13']
sns.pairplot(data=df2[['PT08.S4(NO2)','PT08.S3(NOx)','PT08.S2(NMHC)','AH','C6H6(GT)']], hue='C6H6(GT)', dropna=True, height=2, palette=kot)
fig.suptitle('Classification charts',fontsize=34,color='#274e13',alpha=0.3)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()
<Figure size 1440x1800 with 0 Axes>

Artykuł Feature Selection Techniques (by filter methods): numerical_ input, categorical output pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>