In [1]:

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/tit_train.csv', na_values="-1")
df.head(2)

I started with a loop connecting two functions in pairs¶

## ile jest zmiennych
a,b = df.shape     #<- ile mamy kolumn
b

13

for i in range(1,b):
    i = df.columns[i]
    for f in range (1,b):
        f = df.columns[f]
        print(i,f)

PassengerId PassengerId
PassengerId Survived
PassengerId Pclass
PassengerId Name
PassengerId Sex
PassengerId Age
PassengerId SibSp
PassengerId Parch
PassengerId Ticket
PassengerId Fare
PassengerId Cabin
PassengerId Embarked
Survived PassengerId
Survived Survived
Survived Pclass
Survived Name
Survived Sex
Survived Age
Survived SibSp
Survived Parch
Survived Ticket
Survived Fare
Survived Cabin
Survived Embarked
Pclass PassengerId
Pclass Survived
Pclass Pclass
Pclass Name
Pclass Sex
Pclass Age
Pclass SibSp
Pclass Parch
Pclass Ticket
Pclass Fare
Pclass Cabin
Pclass Embarked
Name PassengerId
Name Survived
Name Pclass
Name Name
Name Sex
Name Age
Name SibSp
Name Parch
Name Ticket
Name Fare
Name Cabin
Name Embarked
Sex PassengerId
Sex Survived
Sex Pclass
Sex Name
Sex Sex
Sex Age
Sex SibSp
Sex Parch
Sex Ticket
Sex Fare
Sex Cabin
Sex Embarked
Age PassengerId
Age Survived
Age Pclass
Age Name
Age Sex
Age Age
Age SibSp
Age Parch
Age Ticket
Age Fare
Age Cabin
Age Embarked
SibSp PassengerId
SibSp Survived
SibSp Pclass
SibSp Name
SibSp Sex
SibSp Age
SibSp SibSp
SibSp Parch
SibSp Ticket
SibSp Fare
SibSp Cabin
SibSp Embarked
Parch PassengerId
Parch Survived
Parch Pclass
Parch Name
Parch Sex
Parch Age
Parch SibSp
Parch Parch
Parch Ticket
Parch Fare
Parch Cabin
Parch Embarked
Ticket PassengerId
Ticket Survived
Ticket Pclass
Ticket Name
Ticket Sex
Ticket Age
Ticket SibSp
Ticket Parch
Ticket Ticket
Ticket Fare
Ticket Cabin
Ticket Embarked
Fare PassengerId
Fare Survived
Fare Pclass
Fare Name
Fare Sex
Fare Age
Fare SibSp
Fare Parch
Fare Ticket
Fare Fare
Fare Cabin
Fare Embarked
Cabin PassengerId
Cabin Survived
Cabin Pclass
Cabin Name
Cabin Sex
Cabin Age
Cabin SibSp
Cabin Parch
Cabin Ticket
Cabin Fare
Cabin Cabin
Cabin Embarked
Embarked PassengerId
Embarked Survived
Embarked Pclass
Embarked Name
Embarked Sex
Embarked Age
Embarked SibSp
Embarked Parch
Embarked Ticket
Embarked Fare
Embarked Cabin
Embarked Embarked

NUMBER OF EMPTY RECORDS vs. FULL RECORDS
----------------------------------------
Age -------- 177 -------- 714
Cabin -------- 687 -------- 204
Embarked -------- 2 -------- 889

Unnamed: 0     0
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

(891, 13)

DISCRETE FUNCTIONS CODED
------------------------
Name --- object
Sex --- object
Ticket --- object
Cabin --- object
Embarked --- object

b:  13
a : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

Using loops in place of gaps I insert values out of range¶

print('NUMBER OF EMPTY RECORDS vs. FULL RECORDS')
print('----------------------------------------')
for i in range(1,b):
    i = df.columns[i]
    r = df[i].isnull().sum()
    h = df[i].count()
   
    if r > 0:
        print(i,"--------",r,"--------",h)

NUMBER OF EMPTY RECORDS vs. FULL RECORDS
----------------------------------------
Age -------- 177 -------- 714
Cabin -------- 687 -------- 204
Embarked -------- 2 -------- 889

Unnamed: 0     0
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

(891, 13)

DISCRETE FUNCTIONS CODED
------------------------
Name --- object
Sex --- object
Ticket --- object
Cabin --- object
Embarked --- object

b:  13
a : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

df.fillna(-777, inplace=True)

df = df.dropna(how='any')
df.isnull().sum()

Unnamed: 0     0
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

df.shape

(891, 13)

Encodes discrete (categorical) variables¶

import numpy as np

a,b = df.shape     #<- ile mamy kolumn
b


print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue

DISCRETE FUNCTIONS CODED
------------------------
Name --- object
Sex --- object
Ticket --- object
Cabin --- object
Embarked --- object

b:  13
a : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

I run the LinearRegression () model

y = df['Survived']
X = df.drop('Survived', axis=1)

from sklearn.linear_model import LinearRegression

regr = LinearRegression()

I create loops for two variables based on LinearRegression ()¶

c,b = df.shape     #<- ile mamy kolumn
print('b: ',b)

a = list(range(1,b))
print('a :', a)

b:  13
a : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

from sklearn import metrics
b= b-2


for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        
        y = df['Survived']       
        X = df.drop('Survived', axis=1)
        
        #a = X.columns[i]
        
        #b = X.columns[f]
        
        col = X.columns[[i,f]]   #<-- nazwy kolumn
        X = X[col]               #<-- FAKTYCZNE warianty zbioru X
        regr.fit(X, y)
        y_pred = regr.predict(X)
        R = regr.score(X, y)
        R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
        RR2 = R2+R
        if RR2 > 0.72:
        # print(' R2: 
            print(' RR2:

 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

I create loops for three variables based on LinearRegression ()¶

from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
        
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            regr.fit(X, y)
            y_pred = regr.predict(X)
            R = regr.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
            RR2 = R2+R
            if RR2 >= 0.757:
       
                print(' RR2:

 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

I create loops for four variables based on LinearRegression ()¶

from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
            for r in range (1,b):
                r = a[r]
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g,r]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            regr.fit(X, y)
            y_pred = regr.predict(X)
            R = regr.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
            RR2 = R2+R
            if RR2 >= 0.761:
       
                print(' RR2:

 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

I am starting the RandomForestRegressor model

y = df['Survived']       
X = df.drop('Survived', axis=1)
print(X.shape)
print(y.shape)

(891, 12)
(891,)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

from sklearn.ensemble import RandomForestRegressor

model_RFC1 = RandomForestRegressor().fit(X, y)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
        
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            model_RFC1.fit(X, y)
            y_pred2 = model_RFC1.predict(X)
            R = model_RFC1.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred2))
            RR2 = R2+R
            if RR2 >= 1.05:
       
                print(' RR2:

 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

My own tractor comes to similar conclusions as other tools in the series: Feature Selection Techniques.
Only that my tractor is probably faster in calculations …..

010420201017

Forward selection is an iterative method in which we start with no function in the model. In each iteration, we add a function that best improves our model until adding a new variable improves the model’s performance.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://archive.ics.uci.edu/ml/datasets/QSAR+oral+toxicity

df = pd.read_csv ('/home/wojciech/Pulpit/6/qsar_oral_toxicity.csv', sep=';')
green(df.shape)
df.head(3)

(8991, 1025)

Series([], dtype: int64)

(8991, 1025)

(8991, 1025)
(8514, 1025)

0            int64
0.1          int64
0.2          int64
0.3          int64
0.4          int64
             ...  
0.967        int64
0.968        int64
0.969        int64
0.970        int64
negative    object
Length: 1025, dtype: object

Index(['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9',
       ...
       '0.962', '0.963', '0.964', '0.965', '0.966', '0.967', '0.968', '0.969',
       '0.970', 'negative'],
      dtype='object', length=1025)

0    7795
1     719
Name: negative, dtype: int64

0    0
1    0
Name: ident, dtype: int8

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1024 out of 1024 | elapsed:   39.9s finished

[2020-04-01 09:39:39] Features: 1/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1023 out of 1023 | elapsed:   43.9s finished

[2020-04-01 09:40:23] Features: 2/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1022 out of 1022 | elapsed:   46.6s finished

[2020-04-01 09:41:09] Features: 3/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1021 out of 1021 | elapsed:   50.6s finished

[2020-04-01 09:42:00] Features: 4/15 -- score: -0.0917097195802096[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1020 out of 1020 | elapsed:   51.1s finished

[2020-04-01 09:42:51] Features: 5/15 -- score: -0.06352720295000061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1019 out of 1019 | elapsed:   54.3s finished

[2020-04-01 09:43:45] Features: 6/15 -- score: -0.050378334885553946[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1018 out of 1018 | elapsed:   58.9s finished

[2020-04-01 09:44:44] Features: 7/15 -- score: -0.035404146912808354[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1017 out of 1017 | elapsed:  1.0min finished

[2020-04-01 09:45:45] Features: 8/15 -- score: -0.014731021991353432[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1016 out of 1016 | elapsed:  1.0min finished

[2020-04-01 09:46:46] Features: 9/15 -- score: 0.007752557690146133[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1015 out of 1015 | elapsed:  1.1min finished

[2020-04-01 09:47:49] Features: 10/15 -- score: 0.034005698374276985[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1014 out of 1014 | elapsed:  1.1min finished

[2020-04-01 09:48:54] Features: 11/15 -- score: 0.0415299552312852[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1013 out of 1013 | elapsed:  1.3min finished

[2020-04-01 09:50:10] Features: 12/15 -- score: 0.050864666848338125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1012 out of 1012 | elapsed:  1.3min finished

[2020-04-01 09:51:28] Features: 13/15 -- score: 0.058403788853600556[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1011 out of 1011 | elapsed:  1.3min finished

[2020-04-01 09:52:48] Features: 14/15 -- score: 0.062143619559723404[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1010 out of 1010 | elapsed:  1.3min finished

[2020-04-01 09:54:09] Features: 15/15 -- score: 0.06776823076716185

I’m looking for empty cells¶

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Series([], dtype: int64)

(8991, 1025)

(8991, 1025)
(8514, 1025)

0            int64
0.1          int64
0.2          int64
0.3          int64
0.4          int64
             ...  
0.967        int64
0.968        int64
0.969        int64
0.970        int64
negative    object
Length: 1025, dtype: object

Index(['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9',
       ...
       '0.962', '0.963', '0.964', '0.965', '0.966', '0.967', '0.968', '0.969',
       '0.970', 'negative'],
      dtype='object', length=1025)

0    7795
1     719
Name: negative, dtype: int64

0    0
1    0
Name: ident, dtype: int8

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1024 out of 1024 | elapsed:   39.9s finished

[2020-04-01 09:39:39] Features: 1/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1023 out of 1023 | elapsed:   43.9s finished

[2020-04-01 09:40:23] Features: 2/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1022 out of 1022 | elapsed:   46.6s finished

[2020-04-01 09:41:09] Features: 3/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1021 out of 1021 | elapsed:   50.6s finished

[2020-04-01 09:42:00] Features: 4/15 -- score: -0.0917097195802096[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1020 out of 1020 | elapsed:   51.1s finished

[2020-04-01 09:42:51] Features: 5/15 -- score: -0.06352720295000061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1019 out of 1019 | elapsed:   54.3s finished

[2020-04-01 09:43:45] Features: 6/15 -- score: -0.050378334885553946[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1018 out of 1018 | elapsed:   58.9s finished

[2020-04-01 09:44:44] Features: 7/15 -- score: -0.035404146912808354[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1017 out of 1017 | elapsed:  1.0min finished

[2020-04-01 09:45:45] Features: 8/15 -- score: -0.014731021991353432[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1016 out of 1016 | elapsed:  1.0min finished

[2020-04-01 09:46:46] Features: 9/15 -- score: 0.007752557690146133[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1015 out of 1015 | elapsed:  1.1min finished

[2020-04-01 09:47:49] Features: 10/15 -- score: 0.034005698374276985[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1014 out of 1014 | elapsed:  1.1min finished

[2020-04-01 09:48:54] Features: 11/15 -- score: 0.0415299552312852[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1013 out of 1013 | elapsed:  1.3min finished

[2020-04-01 09:50:10] Features: 12/15 -- score: 0.050864666848338125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1012 out of 1012 | elapsed:  1.3min finished

[2020-04-01 09:51:28] Features: 13/15 -- score: 0.058403788853600556[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1011 out of 1011 | elapsed:  1.3min finished

[2020-04-01 09:52:48] Features: 14/15 -- score: 0.062143619559723404[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1010 out of 1010 | elapsed:  1.3min finished

[2020-04-01 09:54:09] Features: 15/15 -- score: 0.06776823076716185

[0, 1, 2, 67, 93, 231, 426, 506, 512, 526, 558, 559, 696, 795, 939]

null_value = df.isnull().sum(axis=0)
null_value[null_value != 0]

Series([], dtype: int64)

Mark empty cells as -999¶

df.fillna(-999, inplace=True)

df.shape

(8991, 1025)

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(8991, 1025)
(8514, 1025)

0            int64
0.1          int64
0.2          int64
0.3          int64
0.4          int64
             ...  
0.967        int64
0.968        int64
0.969        int64
0.970        int64
negative    object
Length: 1025, dtype: object

Index(['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9',
       ...
       '0.962', '0.963', '0.964', '0.965', '0.966', '0.967', '0.968', '0.969',
       '0.970', 'negative'],
      dtype='object', length=1025)

0    7795
1     719
Name: negative, dtype: int64

0    0
1    0
Name: ident, dtype: int8

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1024 out of 1024 | elapsed:   39.9s finished

[2020-04-01 09:39:39] Features: 1/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1023 out of 1023 | elapsed:   43.9s finished

[2020-04-01 09:40:23] Features: 2/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1022 out of 1022 | elapsed:   46.6s finished

[2020-04-01 09:41:09] Features: 3/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1021 out of 1021 | elapsed:   50.6s finished

[2020-04-01 09:42:00] Features: 4/15 -- score: -0.0917097195802096[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1020 out of 1020 | elapsed:   51.1s finished

[2020-04-01 09:42:51] Features: 5/15 -- score: -0.06352720295000061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1019 out of 1019 | elapsed:   54.3s finished

[2020-04-01 09:43:45] Features: 6/15 -- score: -0.050378334885553946[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1018 out of 1018 | elapsed:   58.9s finished

[2020-04-01 09:44:44] Features: 7/15 -- score: -0.035404146912808354[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1017 out of 1017 | elapsed:  1.0min finished

[2020-04-01 09:45:45] Features: 8/15 -- score: -0.014731021991353432[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1016 out of 1016 | elapsed:  1.0min finished

[2020-04-01 09:46:46] Features: 9/15 -- score: 0.007752557690146133[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1015 out of 1015 | elapsed:  1.1min finished

[2020-04-01 09:47:49] Features: 10/15 -- score: 0.034005698374276985[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1014 out of 1014 | elapsed:  1.1min finished

[2020-04-01 09:48:54] Features: 11/15 -- score: 0.0415299552312852[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1013 out of 1013 | elapsed:  1.3min finished

[2020-04-01 09:50:10] Features: 12/15 -- score: 0.050864666848338125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1012 out of 1012 | elapsed:  1.3min finished

[2020-04-01 09:51:28] Features: 13/15 -- score: 0.058403788853600556[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1011 out of 1011 | elapsed:  1.3min finished

[2020-04-01 09:52:48] Features: 14/15 -- score: 0.062143619559723404[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1010 out of 1010 | elapsed:  1.3min finished

[2020-04-01 09:54:09] Features: 15/15 -- score: 0.06776823076716185

[0, 1, 2, 67, 93, 231, 426, 506, 512, 526, 558, 559, 696, 795, 939]

0.13 939
0.12 795
0.11 696
0.10 559
1 558
0.9 526
0.8 512
0.7 506
0.6 426
0.5 231
0.4 93
0.3 67
0.2 2
0.1 1
0 0

Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

(8514, 1025)

blue(df.dtypes)

0            int64
0.1          int64
0.2          int64
0.3          int64
0.4          int64
             ...  
0.967        int64
0.968        int64
0.969        int64
0.970        int64
negative    object
Length: 1025, dtype: object

Index(['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9',
       ...
       '0.962', '0.963', '0.964', '0.965', '0.966', '0.967', '0.968', '0.969',
       '0.970', 'negative'],
      dtype='object', length=1025)

0    7795
1     719
Name: negative, dtype: int64

0    0
1    0
Name: ident, dtype: int8

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1024 out of 1024 | elapsed:   39.9s finished

[2020-04-01 09:39:39] Features: 1/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1023 out of 1023 | elapsed:   43.9s finished

[2020-04-01 09:40:23] Features: 2/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1022 out of 1022 | elapsed:   46.6s finished

[2020-04-01 09:41:09] Features: 3/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1021 out of 1021 | elapsed:   50.6s finished

[2020-04-01 09:42:00] Features: 4/15 -- score: -0.0917097195802096[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1020 out of 1020 | elapsed:   51.1s finished

[2020-04-01 09:42:51] Features: 5/15 -- score: -0.06352720295000061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1019 out of 1019 | elapsed:   54.3s finished

[2020-04-01 09:43:45] Features: 6/15 -- score: -0.050378334885553946[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1018 out of 1018 | elapsed:   58.9s finished

[2020-04-01 09:44:44] Features: 7/15 -- score: -0.035404146912808354[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1017 out of 1017 | elapsed:  1.0min finished

[2020-04-01 09:45:45] Features: 8/15 -- score: -0.014731021991353432[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1016 out of 1016 | elapsed:  1.0min finished

[2020-04-01 09:46:46] Features: 9/15 -- score: 0.007752557690146133[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1015 out of 1015 | elapsed:  1.1min finished

[2020-04-01 09:47:49] Features: 10/15 -- score: 0.034005698374276985[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1014 out of 1014 | elapsed:  1.1min finished

[2020-04-01 09:48:54] Features: 11/15 -- score: 0.0415299552312852[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1013 out of 1013 | elapsed:  1.3min finished

[2020-04-01 09:50:10] Features: 12/15 -- score: 0.050864666848338125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1012 out of 1012 | elapsed:  1.3min finished

[2020-04-01 09:51:28] Features: 13/15 -- score: 0.058403788853600556[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1011 out of 1011 | elapsed:  1.3min finished

[2020-04-01 09:52:48] Features: 14/15 -- score: 0.062143619559723404[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1010 out of 1010 | elapsed:  1.3min finished

[2020-04-01 09:54:09] Features: 15/15 -- score: 0.06776823076716185

[0, 1, 2, 67, 93, 231, 426, 506, 512, 526, 558, 559, 696, 795, 939]

0.13 939
0.12 795
0.11 696
0.10 559
1 558
0.9 526
0.8 512
0.7 506
0.6 426
0.5 231
0.4 93
0.3 67
0.2 2
0.1 1
0 0

Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

(8514, 1025)

Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

df.columns

Index(['0', '0.1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9',
       ...
       '0.962', '0.963', '0.964', '0.965', '0.966', '0.967', '0.968', '0.969',
       '0.970', 'negative'],
      dtype='object', length=1025)

Encodes the resulting value¶

df['negative'] = pd.Categorical(df['negative']).codes
df['negative'].value_counts()

0    7795
1     719
Name: negative, dtype: int64

df.rename(columns={'negative':'ident'}, inplace=True)
df['ident'].head(2)

0    0
1    0
Name: ident, dtype: int8

Step Forward Selection¶

X = df.drop('ident', axis=1) 
y = df['ident']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

I specify how many programs should indicate the best variables:

¶

k_features = 15

from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

LR = LogisticRegression()

sfs1 = sfs(LR,k_features = k_features, forward=True, floating=False, scoring='r2',verbose=2,cv=5)
sfs1 = sfs1.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1024 out of 1024 | elapsed:   39.9s finished

[2020-04-01 09:39:39] Features: 1/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1023 out of 1023 | elapsed:   43.9s finished

[2020-04-01 09:40:23] Features: 2/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1022 out of 1022 | elapsed:   46.6s finished

[2020-04-01 09:41:09] Features: 3/15 -- score: -0.09360936522977936[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1021 out of 1021 | elapsed:   50.6s finished

[2020-04-01 09:42:00] Features: 4/15 -- score: -0.0917097195802096[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1020 out of 1020 | elapsed:   51.1s finished

[2020-04-01 09:42:51] Features: 5/15 -- score: -0.06352720295000061[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1019 out of 1019 | elapsed:   54.3s finished

[2020-04-01 09:43:45] Features: 6/15 -- score: -0.050378334885553946[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1018 out of 1018 | elapsed:   58.9s finished

[2020-04-01 09:44:44] Features: 7/15 -- score: -0.035404146912808354[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1017 out of 1017 | elapsed:  1.0min finished

[2020-04-01 09:45:45] Features: 8/15 -- score: -0.014731021991353432[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1016 out of 1016 | elapsed:  1.0min finished

[2020-04-01 09:46:46] Features: 9/15 -- score: 0.007752557690146133[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1015 out of 1015 | elapsed:  1.1min finished

[2020-04-01 09:47:49] Features: 10/15 -- score: 0.034005698374276985[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1014 out of 1014 | elapsed:  1.1min finished

[2020-04-01 09:48:54] Features: 11/15 -- score: 0.0415299552312852[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1013 out of 1013 | elapsed:  1.3min finished

[2020-04-01 09:50:10] Features: 12/15 -- score: 0.050864666848338125[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1012 out of 1012 | elapsed:  1.3min finished

[2020-04-01 09:51:28] Features: 13/15 -- score: 0.058403788853600556[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1011 out of 1011 | elapsed:  1.3min finished

[2020-04-01 09:52:48] Features: 14/15 -- score: 0.062143619559723404[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 1010 out of 1010 | elapsed:  1.3min finished

[2020-04-01 09:54:09] Features: 15/15 -- score: 0.06776823076716185

[0, 1, 2, 67, 93, 231, 426, 506, 512, 526, 558, 559, 696, 795, 939]

0.13 939
0.12 795
0.11 696
0.10 559
1 558
0.9 526
0.8 512
0.7 506
0.6 426
0.5 231
0.4 93
0.3 67
0.2 2
0.1 1
0 0

Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

(8514, 1025)

Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

auc 0.6866915223433825

(8514, 16)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

feat_cols =list(sfs1.k_feature_idx_)
print(feat_cols)

[0, 1, 2, 67, 93, 231, 426, 506, 512, 526, 558, 559, 696, 795, 939]

0.13 939
0.12 795
0.11 696
0.10 559
1 558
0.9 526
0.8 512
0.7 506
0.6 426
0.5 231
0.4 93
0.3 67
0.2 2
0.1 1
0 0

Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

(8514, 1025)

Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

auc 0.6866915223433825

(8514, 16)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

PPS = feat_cols

KOT_lasso = dict(zip(df, PPS))
KOT_sorted_keys_lasso = sorted(KOT_lasso, key=KOT_lasso.get, reverse=True)

for r in KOT_sorted_keys_lasso:
    print (r, (KOT_lasso[r]))

0.13 939
0.12 795
0.11 696
0.10 559
1 558
0.9 526
0.8 512
0.7 506
0.6 426
0.5 231
0.4 93
0.3 67
0.2 2
0.1 1
0 0

Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

(8514, 1025)

Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

auc 0.6866915223433825

(8514, 16)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

new_cols = df.columns[feat_cols]
new_cols

Index(['0', '0.1', '0.2', '0.64', '0.87', '0.218', '0.406', '0.479', '0.485',
       '0.499', '0.530', '0.531', '0.658', '0.751', '0.891'],
      dtype='object')

Creates a dataset with reduced columns¶

df2 = df[new_cols]
df2['ident']=df['ident']
df2.head(3)

# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest, y_pred):
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(classification_report(ytest, model.predict(Xtest)))
    
    y_pred_proba = model.predict_proba(Xtest)[::,1]
    fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
    auc = metrics.roc_auc_score(ytest, y_pred)
    plt.plot(fpr, tpr, label='Logistic Regression (auc = 
    plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
    plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.legend(loc=4)
    plt.plot([0, 1], [0, 1],'r--')
    plt.show()
    print('auc',auc)

Logistic regression model for variables before reduction¶

blue(df.shape)

(8514, 1025)

Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

auc 0.6866915223433825

(8514, 16)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

X1 = df.drop('ident', axis=1) 
y1 = df['ident']  


from sklearn.model_selection import train_test_split 

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.20, random_state=123,stratify=y1)

from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X1_train,y1_train)
y1_pred = logmodel.predict(X1_test)

Classification_Assessment(logmodel ,X1_train, y1_train, X1_test, y1_test, y1_pred)

Recall Training data:      0.6887
Precision Training data:   0.9188
----------------------------------------------------------------------
Recall Test data:          0.3958
Precision Test data:       0.6196
----------------------------------------------------------------------
Confusion Matrix Test data
[[1524   35]
 [  87   57]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1559
           1       0.62      0.40      0.48       144

    accuracy                           0.93      1703
   macro avg       0.78      0.69      0.72      1703
weighted avg       0.92      0.93      0.92      1703

auc 0.6866915223433825

(8514, 16)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

Logistic regression model for variables after reduction¶

blue(df2.shape)

(8514, 16)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

X2 = df2.drop('ident', axis=1) 
y2 = df2['ident']  


from sklearn.model_selection import train_test_split 

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.20, random_state=123,stratify=y2)

from sklearn.linear_model import LogisticRegression

logmodel2 = LogisticRegression()
logmodel2.fit(X2_train,y2_train)
y2_pred = logmodel2.predict(X2_test)

Classification_Assessment(logmodel2 ,X2_train, y2_train, X2_test, y2_test, y2_pred)

Recall Training data:      0.2104
Precision Training data:   0.7707
----------------------------------------------------------------------
Recall Test data:          0.1458
Precision Test data:       0.6562
----------------------------------------------------------------------
Confusion Matrix Test data
[[1548   11]
 [ 123   21]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      1559
           1       0.66      0.15      0.24       144

    accuracy                           0.92      1703
   macro avg       0.79      0.57      0.60      1703
weighted avg       0.90      0.92      0.90      1703

auc 0.569388764165063

RFECV differs from Recursive Feature Elimination (RFE) in the function selection process in that it indicates the OPTIMAL NUMBER OF VARIABLES and not the designated number of best variables.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

Deleting unneeded columns¶

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

(569, 15)

(569, 33)

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

(569, 15)

(569, 33)

R2: 0.980200

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

(569, 15)

(569, 33)

R2: 0.980200

(569, 16)

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

(569, 15)

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

Recursive Feature Elimination and cross-validated selection (RFECV)¶

X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']

I set the minimum number of variables that will remain in the model¶

min_v = 2

from sklearn.feature_selection import RFECV
from sklearn.svm import SVR

estimator = SVR(kernel="linear")
RCV = RFECV(estimator, step=1,min_features_to_select=min_v, cv=5)
RCV = RCV.fit(X, y)
RCV.support_


print('The mask of selected features: ',RCV.support_)
print()
print('The feature ranking:',RCV.ranking_)
print()
print('The external estimator:',RCV.estimator_)



print("Optimal number of features : 

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(RCV.grid_scores_) + 1), RCV.grid_scores_)
plt.show()

OPTIMAL Number of selected functions:   15

The mask of selected features:  [ True  True  True False False  True False  True False  True False  True
 False False False False False False False  True  True  True False False
  True  True  True  True False  True False False]

The feature ranking: [ 1  1  1  2  9  1  3  1 13  1  7  1 11 17  8 10 16 14 18  1  1  1 12  6
  1  1  1  1  5  1 15  4]

The external estimator: SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)
Optimal number of features : 15

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

(569, 15)

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

The RFECV algorithm checked all combinations and showed on the graph that the number of 15 variables was optimal.

Metoda zip na wyświetlenie rankingu cech¶

PPS = RCV.ranking_

KOT_MIC = dict(zip(df, PPS))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)

for r in KOT_sorted_keys_MIC:
    print (r, KOT_MIC[r])

symmetry_se 18
area_se 17
concavity_se 16
concave_points_worst 15
concave points_se 14
symmetry_mean 13
perimeter_worst 12
perimeter_se 11
compactness_se 10
smoothness_mean 9
smoothness_se 8
radius_se 7
area_worst 6
symmetry_worst 5
concave_points_se 4
concavity_mean 3
area_mean 2
radius_mean 1
texture_mean 1
perimeter_mean 1
compactness_mean 1
concave points_mean 1
fractal_dimension_mean 1
texture_se 1
fractal_dimension_se 1
radius_worst 1
texture_worst 1
smoothness_worst 1
compactness_worst 1
concavity_worst 1
concave points_worst 1
fractal_dimension_worst 1

(569, 15)

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

new_cols = X.columns[RCV.support_]

df2 = df[new_cols]
blue(df2.shape)
df2.head(3)

(569, 15)

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

We’re adding a result variable¶

df2['compactness_mean'] = df['compactness_mean']
df2.head(3)

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction¶

blue(df.shape)

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables after reduction¶

blue(df2.shape)

(569, 16)

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')

R2: 0.960155
The reduction of dimensions caused the deterioration of the models properties

Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration. Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold. Here we will do feature selection using Lasso regularization. If the feature is irrelevant, lasso penalizes its coefficient and make it 0. Hence the features with coefficient = 0 are removed and the rest are taken.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

Deleting unneeded columns¶

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

(569, 33)

R2: 0.980200

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

(569, 33)

R2: 0.980200

(569, 3)

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

(569, 33)

R2: 0.980200

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

(569, 33)

R2: 0.980200

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

Lasso¶

X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']

I set the number of variables that will remain in the model¶

Num_v = 15

from sklearn import linear_model

#rlasso = RandomizedLasso(alpha=0.025)

# Standaryzacja zmiennych

clf = linear_model.Lasso(alpha=0.1, positive=True)
clf.fit(X, y)


blue(clf.coef_)
print()
green(clf.intercept_)
print()
red(clf.score(X,y))

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 2.11821738e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 8.17079026e-04 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

0.015845670027763575

0.3452546166160324

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

(569, 33)

R2: 0.980200

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

The positive parameter, which on Truei forces the coefficients to be positive. In addition, setting alpha regularization to a value close to 0 (i.e., 0.001) causes Lasso to mimic linear regression without regularization.

Metoda zip na wyświetlenie rankingu cech¶

PPS = clf.coef_

KOT_lasso = dict(zip(df, PPS))
KOT_sorted_keys_lasso = sorted(KOT_lasso, key=KOT_lasso.get, reverse=True)

for r in KOT_sorted_keys_lasso:
    print (r, (KOT_lasso[r]))

texture_worst 0.0008170790257354554
perimeter_se 2.118217382166424e-05
radius_mean 0.0
texture_mean 0.0
perimeter_mean 0.0
area_mean 0.0
smoothness_mean 0.0
compactness_mean 0.0
concavity_mean 0.0
concave points_mean 0.0
symmetry_mean 0.0
fractal_dimension_mean 0.0
radius_se 0.0
texture_se 0.0
area_se 0.0
smoothness_se 0.0
compactness_se 0.0
concavity_se 0.0
concave points_se 0.0
symmetry_se 0.0
fractal_dimension_se 0.0
radius_worst 0.0
perimeter_worst 0.0
area_worst 0.0
smoothness_worst 0.0
compactness_worst 0.0
concavity_worst 0.0
concave points_worst 0.0
symmetry_worst 0.0
fractal_dimension_worst 0.0
concave_points_worst 0.0
concave_points_se 0.0

(569, 33)

R2: 0.980200

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

We’re adding a result variable¶

df2 = df[['compactness_mean','texture_worst','perimeter_se']]
df2.head(3)

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction¶

blue(df.shape)

(569, 33)

R2: 0.980200

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

OLS linear regression model for variables after reduction¶

blue(df2.shape)

(569, 3)

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).')

R2: 0.321180
The R2 coefficient is approximately similar to the previously calculated clf.score (X, y).

300320201719

It is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the features based on the order of their elimination.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

Deleting unneeded columns¶

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

(569, 33)

R2: 0.980200

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

(569, 33)

R2: 0.980200

(569, 16)

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

Recursive Feature elimination¶

X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

I set the number of variables that will remain in the model¶

Num_v = 15

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

model=LinearRegression()
rfe=RFE(model,Num_v)

# Standaryzacja zmiennych

X_rfe = rfe.fit_transform(X,y)

model.fit(X_rfe,y)

print('Number of selected functions:  ',rfe.n_features_)
print()
print('The mask of selected features: ',rfe.support_)
print()
print('The feature ranking:',rfe.ranking_)
print()
print('The external estimator:',rfe.estimator_)

Number of selected functions:   15

The mask of selected features:  [False False False False  True  True  True False  True False False False
 False  True  True False  True  True  True False False False False  True
  True  True False False  True False  True  True]

The feature ranking: [ 3 14  4 15  1  1  1  2  1  7 11  8 16  1  1  6  1  1  1 12 17 13 18  1
  1  1 10  5  1  9  1  1]

The external estimator: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

Metoda zip na wyświetlenie rankingu cech¶

PPS = rfe.ranking_

KOT_MIC = dict(zip(df, PPS))
KOT_sorted_keys_MIC = sorted(KOT_MIC, key=KOT_MIC.get, reverse=True)

for r in KOT_sorted_keys_MIC:
    print (r, KOT_MIC[r])

perimeter_worst 18
radius_worst 17
perimeter_se 16
area_mean 15
texture_mean 14
texture_worst 13
fractal_dimension_se 12
radius_se 11
concavity_worst 10
fractal_dimension_worst 9
texture_se 8
fractal_dimension_mean 7
compactness_se 6
concave points_worst 5
perimeter_mean 4
radius_mean 3
concave points_mean 2
smoothness_mean 1
compactness_mean 1
concavity_mean 1
symmetry_mean 1
area_se 1
smoothness_se 1
concavity_se 1
concave points_se 1
symmetry_se 1
area_worst 1
smoothness_worst 1
compactness_worst 1
symmetry_worst 1
concave_points_worst 1
concave_points_se 1

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

new_cols = X.columns[rfe.support_]

df2 = df[new_cols]
df2.head(3)

We’re adding a result variable¶

df2['compactness_mean'] = df['compactness_mean']
df2.head(3)

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction¶

blue(df.shape)

(569, 33)

R2: 0.980200

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables after reduction¶

blue(df2.shape)

(569, 16)

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')

R2: 0.960830
The reduction of dimensions caused the deterioration of the models properties

300320201313

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

(569, 33)

Deleting unneeded columns¶

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

(569, 33)

(569, 33)

R2: 0.980200

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

(569, 33)

(569, 33)

R2: 0.980200

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

(569, 33)

(569, 33)

R2: 0.980200

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

(569, 33)

(569, 33)

R2: 0.980200

Backward Elimination¶

x = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

cols=list(x.columns)
pmax=1
while (len(cols)>0):
    p=[]
    x_1 = x[cols]
    x_1 = sm.add_constant(x_1)
    model=sm.OLS(y,x_1).fit()
    p=pd.Series(model.pvalues.values[1:],index=cols)
    pmax=max(p)
    features_with_p_max=p.idxmax()
    if(pmax>0.05):
        cols.remove(features_with_p_max)
    else:
        break
new_cols=cols
print(new_cols)

['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean', 'fractal_dimension_mean', 'texture_se', 'compactness_se', 'concave points_se', 'fractal_dimension_se', 'radius_worst', 'perimeter_worst', 'compactness_worst', 'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst', 'concave_points_se']

(569, 33)

(569, 33)

R2: 0.980200

df2 = df[new_cols]
blue(df.shape)
df2.head(3)

(569, 33)

(569, 33)

R2: 0.980200

The Backward Elimination algorithm stated that reducing variables does not improve the model. Therefore, the number of variables was left unchanged.

OLS linear regression model for variables before reduction¶

blue(df.shape)

(569, 33)

R2: 0.980200

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

Forward selection is an iterative method in which we start with no function in the model. In each iteration, we add a function that best improves our model until adding a new variable improves the model’s performance.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Deleting unneeded columns¶

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

(569, 33)

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

(569, 33)

R2: 0.980200

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

(569, 33)

R2: 0.980200

R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

Step Forward Selection¶

X = df.drop('compactness_mean', axis=1) 
y = df['compactness_mean']  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

I specify how many programs should indicate the best variables:

¶

k_features = 16

from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

LR = LinearRegression()

sfs1 = sfs(LR,k_features = k_features, forward=True, floating=False, scoring='r2',verbose=2,cv=5)
sfs1 = sfs1.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  32 out of  32 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 1/16 -- score: 0.7605648031784296[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  31 out of  31 | elapsed:    0.3s finished

[2020-03-30 12:43:15] Features: 2/16 -- score: 0.8592594816229919[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.2s finished

[2020-03-30 12:43:15] Features: 3/16 -- score: 0.9171881609890725[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 4/16 -- score: 0.9392541495763911[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:    0.3s finished

[2020-03-30 12:43:16] Features: 5/16 -- score: 0.9483152571280057[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 6/16 -- score: 0.95545376115284[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  26 out of  26 | elapsed:    0.2s finished

[2020-03-30 12:43:16] Features: 7/16 -- score: 0.9575365106130604[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 8/16 -- score: 0.9679393948794752[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 9/16 -- score: 0.9722927912279392[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  23 out of  23 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 10/16 -- score: 0.9734667931156942[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 11/16 -- score: 0.9743145044074704[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:    0.2s finished

[2020-03-30 12:43:17] Features: 12/16 -- score: 0.9751371831838199[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 13/16 -- score: 0.9753888664795454[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 out of  19 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 14/16 -- score: 0.9756613892479665[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 15/16 -- score: 0.9758538991452695[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  17 out of  17 | elapsed:    0.2s finished

[2020-03-30 12:43:18] Features: 16/16 -- score: 0.9768921740889114

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

(569, 33)

R2: 0.980200

R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

feat_cols =list(sfs1.k_feature_idx_)
print(feat_cols)

[0, 2, 3, 4, 5, 6, 7, 8, 14, 18, 19, 21, 24, 25, 27, 28]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

(569, 33)

R2: 0.980200

R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

X.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

new_cols = df.columns[feat_cols]
new_cols

Index(['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'symmetry_mean', 'smoothness_se', 'symmetry_se', 'fractal_dimension_se',
       'texture_worst', 'smoothness_worst', 'compactness_worst',
       'concave points_worst', 'symmetry_worst'],
      dtype='object')

I create a dataset with reduced columns.

df2 = df[new_cols]
df2.head(3)

OLS linear regression model for variables before reduction¶

blue(df.shape)

(569, 33)

R2: 0.980200

R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables after reduction¶

X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')

R2: 0.966559
The reduction of dimensions caused the deterioration of the models properties

Collinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset.

The Variance Inflation Factor (VIF) technique from the Feature Selection Techniques collection is not intended to improve the quality of the model, but to remove the autocorrelation of independent variables.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

(569, 33)

Deleting unneeded columns¶

df['concave_points_worst'] = df['concave points_worst']
df['concave_points_se'] = df['concave points_se']
df['concave_points_mean'] = df['concave points_mean']

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
concave_points_worst       0
concave_points_se          0
concave_points_mean        0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

(569, 33)

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 33)
(569, 33)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

(569, 33)

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
concave_points_worst       float64
concave_points_se          float64
concave_points_mean        float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

max: 0.3454
min: 0.01938

Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

(569, 33)

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'concave_points_worst',
       'concave_points_se', 'concave_points_mean'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

(569, 33)

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

Variance Inflation Factor (VIF)¶

import pandas as pd
import statsmodels.formula.api as smf

def get_vif(exogs, data):
    '''Return VIF (variance inflation factor) DataFrame

    Args:
    exogs (list): list of exogenous/independent variables
    data (DataFrame): the df storing all variables

    Returns:
    VIF and Tolerance DataFrame for each exogenous variable

    Notes:
    Assume we have a list of exogenous variable [X1, X2, X3, X4].
    To calculate the VIF and Tolerance for each variable, we regress
    each of them against other exogenous variables. For instance, the
    regression model for X3 is defined as:
                        X3 ~ X1 + X2 + X4
    And then we extract the R-squared from the model to calculate:
                    VIF = 1 / (1 - R-squared)
                    Tolerance = 1 - R-squared
    The cutoff to detect multicollinearity:
                    VIF > 10 or Tolerance < 0.1
    '''

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # create formula for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        formula = f"{exog} ~ {' + '.join(not_exog)}"

        # extract r-squared from the fit
        r_squared = smf.ols(formula, data=data).fit().rsquared

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif

# import warnings
# warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.linear_model import LinearRegression

def sklearn_vif(exogs, data):

    # initialize dictionaries
    vif_dict, tolerance_dict = {}, {}

    # form input data for each exogenous variable
    for exog in exogs:
        not_exog = [i for i in exogs if i != exog]
        X, y = data[not_exog], data[exog]

        # extract r-squared from the fit
        r_squared = LinearRegression().fit(X, y).score(X, y)

        # calculate VIF
        vif = 1/(1 - r_squared)
        vif_dict[exog] = vif

        # calculate tolerance
        tolerance = 1 - r_squared
        tolerance_dict[exog] = tolerance

    # return VIF DataFrame
    df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict})

    return df_vif

df.columns
exogs =['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave_points_worst',
       'symmetry_worst', 'fractal_dimension_worst']

print('Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity' )

pks = sklearn_vif(exogs, df)
pks.sort_values('VIF').round(1)
print()
blue('LinearRegression in sklearn')
blue(pks[pks['VIF']<=10])


kot = get_vif(exogs, df)
kot.sort_values('VIF').round(1)
print()
green('LinearRegression in statasmodels')
green(kot[kot['VIF']<=10])

Jeżeli VIF wynosi więcej niż 5 prawdopodobnie występuje multicollinearity

LinearRegression in sklearn
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

LinearRegression in statasmodels
                           VIF  Tolerance
smoothness_mean       8.194282   0.122036
symmetry_mean         4.220656   0.236930
texture_se            4.205423   0.237788
smoothness_se         4.027923   0.248267
symmetry_se           5.175426   0.193221
fractal_dimension_se  9.717987   0.102902
symmetry_worst        9.520570   0.105036

(569, 33)

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables before reduction¶

blue(df.shape)

(569, 33)

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables after reduction¶

df2 =df[['smoothness_mean','symmetry_mean','texture_se','smoothness_se', 'fractal_dimension_se','symmetry_worst','compactness_mean']]

X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')

R2: 0.649990
The reduction of dimensions caused the deterioration of the models properties

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

data source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

df = pd.read_csv ('/home/wojciech/Pulpit/6/Breast_Cancer_Wisconsin.csv')
green(df.shape)
df.head(3)

(569, 33)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

(569, 30)
(569, 30)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

max: 0.3454
min: 0.01938

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

Deleting unneeded columns¶

del df['Unnamed: 32']
del df['diagnosis']
del df['id']

df.isnull().sum()

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

import seaborn as sns

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

(569, 30)
(569, 30)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

max: 0.3454
min: 0.01938

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

Deletes duplicates¶

there were no duplicates

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(569, 30)
(569, 30)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

max: 0.3454
min: 0.01938

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

blue(df.dtypes)

radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

max: 0.3454
min: 0.01938

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

(569, 30)
(569, 20)

df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

We choose the continuous variable – compactness_mean¶

print('max:',df['compactness_mean'].max())
print('min:',df['compactness_mean'].min())

sns.distplot(np.array(df['compactness_mean']))

max: 0.3454
min: 0.01938

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

(569, 30)
(569, 20)

(569, 30)
(569, 20)

R2: 0.980200

Pearson correlation¶

def matrix_plot(df,title):

    sns.set(style="ticks")

    corr = df.corr()
    corr = np.round(corr, decimals=2)


    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    f, ax = plt.subplots(figsize=(20, 20))
    #cmap = sns.diverging_palette(580, 10, as_cmap=True)
    cmap = sns.diverging_palette(180, 90, as_cmap=True) #Inna paleta barw

    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.3, center=0.03,annot=True,
                square=True, linewidths=.9, cbar_kws={"shrink": 0.8})
    plt.xticks(rotation=90)
    plt.title(title,fontsize=32,color='#0c343d',alpha=0.5)
    plt.show

matrix_plot(df,'Pearson correlation')

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

(569, 30)
(569, 20)

(569, 30)
(569, 20)

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

Correlation to the result variable¶

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
CORREL = df.corr().sort_values('compactness_mean')
CORREL['compactness_mean'].plot(kind='barh',color='#0c343d',alpha=0.5)
plt.title('Correlation to the result variable', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')

Text(0, 0.5, 'Continuous independent variables')

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

(569, 30)
(569, 20)

(569, 30)
(569, 20)

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

I find variables that are highly correlated with the result variable¶

kot = abs(CORREL['compactness_mean'])
FAT = kot[kot>=0.7]
FAT

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

Compares variables in pairs¶

plt.barh(*zip(*FAT.items()),color='#0c343d',alpha=0.5) 
plt.xticks(rotation=90)

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2]),
 )

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

(569, 30)
(569, 20)

(569, 30)
(569, 20)

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

High autocorrelation chart ¶

CORR = df.corr()

kot = CORR[CORR>=.9]
plt.figure(figsize=(6,4))
sns.heatmap(kot, cmap="Greens")

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

(569, 30)
(569, 20)

(569, 30)
(569, 20)

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

Deleting correlated independent variables¶

The code we compare the correlation between variables and remove one of two features whose correlation is higher than 0.9

corr = df.corr()
kot = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
    for j in range(i+1, corr.shape[0]):
        if corr.iloc[i,j] >= 0.9:
            if kot[j]:
                kot[j] = False
selected_columns = df.columns[kot]
df2 = df[selected_columns]

kot   #<== PĘTLA ZROBIŁA NAM wektor 31 elementów True- False

array([ True,  True, False, False,  True,  True,  True, False,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False,  True,  True])

Dimensions have been reduced¶

blue(df.shape)
green(df2.shape)

(569, 30)
(569, 20)

(569, 30)
(569, 20)

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables before reduction¶

blue(df.shape)
green(df2.shape)

(569, 30)
(569, 20)

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

X1 = df.drop('compactness_mean', axis=1) 
y1 = df['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y1, sm.add_constant(X1))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())

R2: 0.980200

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

OLS linear regression model for variables after reduction¶

X2 = df2.drop('compactness_mean', axis=1) 
y2 = df2['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y2, sm.add_constant(X2))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')

R2: 0.965952
The reduction of dimensions caused the deterioration of the models properties

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

Eliminates variables previously selected in the FAT procedure¶

FAT

compactness_se          0.738722
concave points_worst    0.815573
concavity_worst         0.816275
concave points_mean     0.831135
compactness_worst       0.865809
concavity_mean          0.883121
compactness_mean        1.000000
Name: compactness_mean, dtype: float64

df3 = df.drop(['compactness_se','concave points_worst','concavity_worst','concave points_mean','compactness_worst','concavity_mean'],1)

X3 = df3.drop('compactness_mean', axis=1) 
y3 = df3['compactness_mean']

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y3, sm.add_constant(X3))
model_fit = model.fit()

print('R2: 
#blue(model_fit.summary())
red('The reduction of dimensions caused the deterioration of the models properties')

R2: 0.962911
The reduction of dimensions caused the deterioration of the models properties

280320200940

Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
np.random.seed(123)

##  colorful prints
def black(text):
     print('33[30m', text, '33[0m', sep='')  
def red(text):
     print('33[31m', text, '33[0m', sep='')  
def green(text):
     print('33[32m', text, '33[0m', sep='')  
def yellow(text):
     print('33[33m', text, '33[0m', sep='')  
def blue(text):
     print('33[34m', text, '33[0m', sep='') 
def magenta(text):
     print('33[35m', text, '33[0m', sep='')  
def cyan(text):
     print('33[36m', text, '33[0m', sep='')  
def gray(text):
     print('33[90m', text, '33[0m', sep='')

df = pd.read_csv ('/home/wojciech/Pulpit/1/AirQualityUCI.csv', sep=';',nrows=1000)
green(df.shape)
df.head(3)

(1000, 17)

(1000, 15)
(1000, 15)
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

(1000, 15)
(1000, 15)

CO(GT)             0
PT08.S1(CO)       27
NMHC(GT)         274
C6H6(GT)           0
PT08.S2(NMHC)     27
NOx(GT)          206
PT08.S3(NOx)      27
NO2(GT)          206
PT08.S4(NO2)      27
PT08.S5(O3)       27
T                  0
RH                 0
AH                 0
day                0
month              0
hour               0
dtype: int64

(1000, 15)
(768, 15)
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
day              0
month            0
hour             0
dtype: int64

CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
day                int64
month              int64
hour              object
dtype: object

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

Usuwanie niepotrzebnych kolumn¶

del df['Unnamed: 15']
del df['Unnamed: 16']

Kasuje brakujące rekordy¶

green(df.shape)
df.isnull().sum()
df = df.dropna(how='any')
blue(df.shape)
blue(df.isnull().sum())

(1000, 15)
(1000, 15)
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

(1000, 15)
(1000, 15)

CO(GT)             0
PT08.S1(CO)       27
NMHC(GT)         274
C6H6(GT)           0
PT08.S2(NMHC)     27
NOx(GT)          206
PT08.S3(NOx)      27
NO2(GT)          206
PT08.S4(NO2)      27
PT08.S5(O3)       27
T                  0
RH                 0
AH                 0
day                0
month              0
hour               0
dtype: int64

(1000, 15)
(768, 15)
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
day              0
month            0
hour             0
dtype: int64

CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
day                int64
month              int64
hour              object
dtype: object

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

0    446
1    322
Name: C6H6(GT), dtype: int64

Kasuje duplikaty¶

nie było duplikatów

green(df.shape)
df.drop_duplicates(keep='first', inplace=True)
blue(df.shape)

(1000, 15)
(1000, 15)

CO(GT)             0
PT08.S1(CO)       27
NMHC(GT)         274
C6H6(GT)           0
PT08.S2(NMHC)     27
NOx(GT)          206
PT08.S3(NOx)      27
NO2(GT)          206
PT08.S4(NO2)      27
PT08.S5(O3)       27
T                  0
RH                 0
AH                 0
day                0
month              0
hour               0
dtype: int64

(1000, 15)
(768, 15)
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
day              0
month            0
hour             0
dtype: int64

CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
day                int64
month              int64
hour              object
dtype: object

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

0    446
1    322
Name: C6H6(GT), dtype: int64

CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object

Z daty wyciągam dzień tygodnia, miesiąc, oraz godzinę jako zmienne ciągłe¶

df['Date'] = pd.to_datetime(df.Date)
df['day'] = df['Date'].dt.weekday
df['month'] = df['Date'].dt.month
df['hour'] = df['Time'].str.slice(0,2)
df[['Date','day','month','hour']].head(3)

del df['Date']
del df['Time']

Kasuje zmienną -200 oznaczającą błąd danych¶

df[['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']] = df[['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']].replace(-200,np.NaN)

df.isnull().sum()

CO(GT)             0
PT08.S1(CO)       27
NMHC(GT)         274
C6H6(GT)           0
PT08.S2(NMHC)     27
NOx(GT)          206
PT08.S3(NOx)      27
NO2(GT)          206
PT08.S4(NO2)      27
PT08.S5(O3)       27
T                  0
RH                 0
AH                 0
day                0
month              0
hour               0
dtype: int64

del df['NMHC(GT)']
green(df.shape)
df.isnull().sum()
df = df.dropna(how='any')
blue(df.shape)
blue(df.isnull().sum())

(1000, 15)
(768, 15)
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
day              0
month            0
hour             0
dtype: int64

CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
day                int64
month              int64
hour              object
dtype: object

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

0    446
1    322
Name: C6H6(GT), dtype: int64

CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object

(768, 14)
(614, 14)
(154, 14)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

Zamieniam zmienne na wartości numeryczne¶

blue(df.dtypes)

CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
day                int64
month              int64
hour              object
dtype: object

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

0    446
1    322
Name: C6H6(GT), dtype: int64

CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object

(768, 14)
(614, 14)
(154, 14)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

Macierz korelacji¶

df['CO(GT)'] = df['CO(GT)'].str.replace(',', '.')

df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',', '.')

df['T'] = df['T'].str.replace(',', '.')

df['RH'] = df['RH'].str.replace(',', '.')

df['AH'] = df['AH'].str.replace(',', '.')

df[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']] = df[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH', 'day', 'month', 'hour']].astype(float)

CORREL = df.corr()
plt.figure(figsize=(10,6))
sns.heatmap(CORREL, annot=True, cbar=False, cmap="coolwarm")

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

0    446
1    322
Name: C6H6(GT), dtype: int64

CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object

(768, 14)
(614, 14)
(154, 14)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Koduje zmienną kategoryczną wynikową – C6H6(GT)¶

print('max:',df['C6H6(GT)'].max())
print('min:',df['C6H6(GT)'].min())

sns.distplot(np.array(df['C6H6(GT)']))

max: 39.2
min: 0.5

0    446
1    322
Name: C6H6(GT), dtype: int64

0    446
1    322
Name: C6H6(GT), dtype: int64

CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object

(768, 14)
(614, 14)
(154, 14)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), )

df['C6H6(GT)'] = df['C6H6(GT)'].apply(lambda x: 1 if x > 10 else 0)
df['C6H6(GT)'].value_counts()

0    446
1    322
Name: C6H6(GT), dtype: int64

df['C6H6(GT)'] = pd.Categorical(df['C6H6(GT)']).codes
df['C6H6(GT)'].value_counts()

0    446
1    322
Name: C6H6(GT), dtype: int64

Model regresji liniowej bez redukcji zmiennych¶

blue(df.dtypes)

CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)            int8
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
day              float64
month            float64
hour             float64
dtype: object

(768, 14)
(614, 14)
(154, 14)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), )

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')

X = df.drop('C6H6(GT)', axis=1) 
y = df['C6H6(GT)']

Podział na dane treningowe i testowe¶

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

Definicje¶

# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest, y_pred):
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(classification_report(ytest, model.predict(Xtest)))
    
    y_pred_proba = model.predict_proba(Xtest)[::,1]
    fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
    auc = metrics.roc_auc_score(ytest, y_pred)
    plt.plot(fpr, tpr, label='Logistic Regression (auc = 
    plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
    plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.legend(loc=4)
    plt.plot([0, 1], [0, 1],'r--')
    plt.show()
    print('auc',auc)

blue(X.shape)
green(X_train.shape)
green(X_test.shape)

(768, 14)
(614, 14)
(154, 14)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), )

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')

Modelu klasyfikacji bez wyboru funkcji¶

import numpy as np
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(X_train, y_train) 
y_pred_LRC = LR_Grid.predict(X_test)

Classification_Assessment(LR_Grid ,X_train, y_train, X_test, y_test, y_pred_LRC)

Recall Training data:      0.9728
Precision Training data:   0.9766
----------------------------------------------------------------------
Recall Test data:          0.9692
Precision Test data:       0.9844
----------------------------------------------------------------------
Confusion Matrix Test data
[[88  1]
 [ 2 63]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.99      0.98        89
           1       0.98      0.97      0.98        65

    accuracy                           0.98       154
   macro avg       0.98      0.98      0.98       154
weighted avg       0.98      0.98      0.98       154

auc 0.9789974070872948

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), )

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')

Redukcja zmiennych niezależnych za pomocą OLS¶

from statsmodels.formula.api import ols
import statsmodels.api as sm

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

blue(model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               C6H6(GT)   R-squared:                       0.691
Model:                            OLS   Adj. R-squared:                  0.685
Method:                 Least Squares   F-statistic:                     120.2
Date:                Sat, 28 Mar 2020   Prob (F-statistic):          4.93e-181
Time:                        09:35:48   Log-Likelihood:                -96.437
No. Observations:                 768   AIC:                             222.9
Df Residuals:                     753   BIC:                             292.5
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -2.0863      0.283     -7.371      0.000      -2.642      -1.531
CO(GT)           -0.0003      0.000     -0.938      0.348      -0.001       0.000
PT08.S1(CO)      -0.0002      0.000     -1.127      0.260      -0.001       0.000
PT08.S2(NMHC)     0.0030      0.000      8.413      0.000       0.002       0.004
NOx(GT)          -0.0002      0.000     -0.395      0.693      -0.001       0.001
PT08.S3(NOx)      0.0007      0.000      5.578      0.000       0.000       0.001
NO2(GT)           0.0013      0.001      1.408      0.160      -0.001       0.003
PT08.S4(NO2)     -0.0007      0.000     -2.763      0.006      -0.001      -0.000
PT08.S5(O3)    3.589e-05   8.45e-05      0.425      0.671      -0.000       0.000
T                 0.0008      0.011      0.073      0.942      -0.021       0.022
RH               -0.0015      0.004     -0.392      0.695      -0.009       0.006
AH                0.4456      0.278      1.602      0.109      -0.100       0.992
day              -0.0008      0.005     -0.138      0.890      -0.011       0.010
month            -0.0009      0.004     -0.243      0.808      -0.008       0.006
hour             -0.0028      0.002     -1.495      0.135      -0.006       0.001
==============================================================================
Omnibus:                       43.664   Durbin-Watson:                   1.370
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.380
Skew:                           0.068   Prob(JB):                     0.000168
Kurtosis:                       2.276   Cond. No.                     8.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.46e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), )

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')

p_values = model_fit.summary2().tables[1]['P>|t|']
## zaokrąglam


p_values = np.round(p_values, decimals=2)
p_values= p_values.sort_values()

plt.figure(figsize=(3,8))
p_values.plot(kind='barh')
plt.title('p-value for independent variables in OLS')
plt.grid(True)
plt.ylabel('independent variables')
plt.xlabel('p-value')
plt.xticks(rotation=90)

(array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), )

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')

Wybieramy zmienne z p-value < 0.1 ¶

df.columns

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'day', 'month', 'hour'],
      dtype='object')

df2= df[['PT08.S4(NO2)','PT08.S3(NOx)','PT08.S2(NMHC)','AH','C6H6(GT)']]

y= y.to_frame()
y.head(4)

fig = plt.figure(figsize = (20, 25))
j = 0
for i in df2.columns:
    plt.subplot(6, 4, j+1)
    j = 1+j
    sns.distplot(df2[i][y['C6H6(GT)']==0], color='#999999', label = '0')
    sns.distplot(df2[i][y['C6H6(GT)']==1], color='#ff0000', label = '1')
    plt.legend(loc='best',fontsize=10)
fig.suptitle('Classification charts',fontsize=34,color='#ff0000',alpha=0.3)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()

def scientist_plot(data, y, AAA, Title):
    fig = plt.figure(figsize = (20, 25))
    j = 0
    for i in df2.columns:
        plt.subplot(6, 4, j+1)
        j = 1+j
        sns.distplot(data[i][y[AAA]==0], color='#999999', label = '0')
        sns.distplot(data[i][y[AAA]==1], color='#274e13', label = '1')
        plt.legend(loc='best',fontsize=10)
    fig.suptitle(Title,fontsize=34,color='#274e13',alpha=0.5)
    fig.tight_layout()
    fig.subplots_adjust(top=0.95)
    plt.show()

scientist_plot(df2, y, 'C6H6(GT)','Classification charts')

fig = plt.figure(figsize = (20, 25))
kot = ['#999999','#274e13']
sns.pairplot(data=df2[['PT08.S4(NO2)','PT08.S3(NOx)','PT08.S2(NMHC)','AH','C6H6(GT)']], hue='C6H6(GT)', dropna=True, height=2, palette=kot)
fig.suptitle('Classification charts',fontsize=34,color='#274e13',alpha=0.3)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	…	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.8	1001.0	0.11840	0.27760	0.3001	0.14710	…	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	…	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.0	1203.0	0.10960	0.15990	0.1974	0.12790	…	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NaN

	radius_mean	texture_mean	perimeter_mean	concavity_mean	symmetry_mean	radius_se	perimeter_se	radius_worst	texture_worst	perimeter_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	concave_points_worst
0	17.99	10.38	122.8	0.3001	0.2419	1.0950	8.589	25.38	17.33	184.6	0.6656	0.7119	0.2654	0.4601	0.2654
1	20.57	17.77	132.9	0.0869	0.1812	0.5435	3.398	24.99	23.41	158.8	0.1866	0.2416	0.1860	0.2750	0.1860
2	19.69	21.25	130.0	0.1974	0.2069	0.7456	4.585	23.57	25.53	152.5	0.4245	0.4504	0.2430	0.3613	0.2430

	radius_mean	texture_mean	perimeter_mean	concavity_mean	symmetry_mean	radius_se	perimeter_se	radius_worst	texture_worst	perimeter_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	concave_points_worst	compactness_mean
0	17.99	10.38	122.8	0.3001	0.2419	1.0950	8.589	25.38	17.33	184.6	0.6656	0.7119	0.2654	0.4601	0.2654	0.27760
1	20.57	17.77	132.9	0.0869	0.1812	0.5435	3.398	24.99	23.41	158.8	0.1866	0.2416	0.1860	0.2750	0.1860	0.07864
2	19.69	21.25	130.0	0.1974	0.2069	0.7456	4.585	23.57	25.53	152.5	0.4245	0.4504	0.2430	0.3613	0.2430	0.15990

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	…	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.8	1001.0	0.11840	0.27760	0.3001	0.14710	…	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	…	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.0	1203.0	0.10960	0.15990	0.1974	0.12790	…	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NaN

	compactness_mean	texture_worst	perimeter_se
0	0.27760	17.33	8.589
1	0.07864	23.41	3.398
2	0.15990	25.53	4.585

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	0	PC 17599	71.2833	C85	C

	0.2	0.7	…	0.964	negative
0	1	0	…	0	negative
1	0	0	…	1	negative
2	0	1	…	0	negative

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH	Unnamed: 15	Unnamed: 16
0	10/03/2004	18.00.00	2,6	1360	150	11,9	1046	166	1056	113	1692	1268	13,6	48,9	0,7578	NaN	NaN
1	10/03/2004	19.00.00	2	1292	112	9,4	955	103	1174	92	1559	972	13,3	47,7	0,7255	NaN	NaN
2	10/03/2004	20.00.00	2,2	1402	88	9,0	939	131	1140	114	1555	1074	11,9	54,0	0,7502	NaN	NaN

	0	0.1	0.2	0.64	0.87	0.218	0.406	0.479	0.485	0.499	0.530	0.531	0.658	0.751	0.891	ident
0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Feature Selection Techniques - THE DATA SCIENCE LIBRARY

Homemade loop to search for the best functions for the regression model (Feature Selection Techniques)

I started with a loop connecting two functions in pairs¶

Using loops in place of gaps I insert values out of range¶

Encodes discrete (categorical) variables¶

I run the LinearRegression () model

I create loops for two variables based on LinearRegression ()¶

I create loops for three variables based on LinearRegression ()¶

I create loops for four variables based on LinearRegression ()¶

I am starting the RandomForestRegressor model

Feature Selection Techniques [categorical result] – Step Forward Selection

I’m looking for empty cells¶

Mark empty cells as -999¶

Deletes duplicates¶

Encodes the resulting value¶

Step Forward Selection¶

Creates a dataset with reduced columns¶

Logistic regression model for variables before reduction¶

Logistic regression model for variables after reduction¶

Feature Selection Techniques – Recursive Feature Elimination and cross-validated selection (RFECV)

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Recursive Feature Elimination and cross-validated selection (RFECV)¶

I set the minimum number of variables that will remain in the model¶

Metoda zip na wyświetlenie rankingu cech¶

We’re adding a result variable¶

OLS linear regression model for variables before reduction¶

OLS linear regression model for variables after reduction¶

Feature Selection Techniques – Embedded Method (Lasso)

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Lasso¶

I set the number of variables that will remain in the model¶

Metoda zip na wyświetlenie rankingu cech¶

We’re adding a result variable¶

OLS linear regression model for variables before reduction¶

OLS linear regression model for variables after reduction¶

Feature Selection Techniques – Recursive Feature Elimination (RFE)

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Recursive Feature elimination¶

I set the number of variables that will remain in the model¶

Metoda zip na wyświetlenie rankingu cech¶

We’re adding a result variable¶

OLS linear regression model for variables before reduction¶

OLS linear regression model for variables after reduction¶

Feature Selection Techniques – Backward Elimination

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Backward Elimination¶

OLS linear regression model for variables before reduction¶

Feature Selection Techniques [numerical result] – Step Forward Selection

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Step Forward Selection¶

I create a dataset with reduced columns.

OLS linear regression model for variables before reduction¶

OLS linear regression model for variables after reduction¶

Feature Selection Techniques – Variance Inflation Factor (VIF)

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Variance Inflation Factor (VIF)¶

OLS linear regression model for variables before reduction¶

OLS linear regression model for variables after reduction¶

Feature Selection Techniques – Pearson correlation

Deleting unneeded columns¶

Deletes duplicates¶

We choose the continuous variable – compactness_mean¶

Pearson correlation¶

Correlation to the result variable¶

I find variables that are highly correlated with the result variable¶

Compares variables in pairs¶

High autocorrelation chart¶

Deleting correlated independent variables¶

High autocorrelation chart ¶

Wybieramy zmienne z p-value < 0.1 ¶