Homemade loop to search for the best functions for the regression model (Feature Selection Techniques)

090420201150

In [1]:
import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/tit_train.csv', na_values="-1")
df.head(2)
Out[1]:
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C

I started with a loop connecting two functions in pairs

In [2]:
## ile jest zmiennych
a,b = df.shape     #<- ile mamy kolumn
b
Out[2]:
13
In [3]:
for i in range(1,b):
    i = df.columns[i]
    for f in range (1,b):
        f = df.columns[f]
        print(i,f)
       
PassengerId PassengerId
PassengerId Survived
PassengerId Pclass
PassengerId Name
PassengerId Sex
PassengerId Age
PassengerId SibSp
PassengerId Parch
PassengerId Ticket
PassengerId Fare
PassengerId Cabin
PassengerId Embarked
Survived PassengerId
Survived Survived
Survived Pclass
Survived Name
Survived Sex
Survived Age
Survived SibSp
Survived Parch
Survived Ticket
Survived Fare
Survived Cabin
Survived Embarked
Pclass PassengerId
Pclass Survived
Pclass Pclass
Pclass Name
Pclass Sex
Pclass Age
Pclass SibSp
Pclass Parch
Pclass Ticket
Pclass Fare
Pclass Cabin
Pclass Embarked
Name PassengerId
Name Survived
Name Pclass
Name Name
Name Sex
Name Age
Name SibSp
Name Parch
Name Ticket
Name Fare
Name Cabin
Name Embarked
Sex PassengerId
Sex Survived
Sex Pclass
Sex Name
Sex Sex
Sex Age
Sex SibSp
Sex Parch
Sex Ticket
Sex Fare
Sex Cabin
Sex Embarked
Age PassengerId
Age Survived
Age Pclass
Age Name
Age Sex
Age Age
Age SibSp
Age Parch
Age Ticket
Age Fare
Age Cabin
Age Embarked
SibSp PassengerId
SibSp Survived
SibSp Pclass
SibSp Name
SibSp Sex
SibSp Age
SibSp SibSp
SibSp Parch
SibSp Ticket
SibSp Fare
SibSp Cabin
SibSp Embarked
Parch PassengerId
Parch Survived
Parch Pclass
Parch Name
Parch Sex
Parch Age
Parch SibSp
Parch Parch
Parch Ticket
Parch Fare
Parch Cabin
Parch Embarked
Ticket PassengerId
Ticket Survived
Ticket Pclass
Ticket Name
Ticket Sex
Ticket Age
Ticket SibSp
Ticket Parch
Ticket Ticket
Ticket Fare
Ticket Cabin
Ticket Embarked
Fare PassengerId
Fare Survived
Fare Pclass
Fare Name
Fare Sex
Fare Age
Fare SibSp
Fare Parch
Fare Ticket
Fare Fare
Fare Cabin
Fare Embarked
Cabin PassengerId
Cabin Survived
Cabin Pclass
Cabin Name
Cabin Sex
Cabin Age
Cabin SibSp
Cabin Parch
Cabin Ticket
Cabin Fare
Cabin Cabin
Cabin Embarked
Embarked PassengerId
Embarked Survived
Embarked Pclass
Embarked Name
Embarked Sex
Embarked Age
Embarked SibSp
Embarked Parch
Embarked Ticket
Embarked Fare
Embarked Cabin
Embarked Embarked

Using loops in place of gaps I insert values out of range

In [4]:
print('NUMBER OF EMPTY RECORDS vs. FULL RECORDS')
print('----------------------------------------')
for i in range(1,b):
    i = df.columns[i]
    r = df[i].isnull().sum()
    h = df[i].count()
   
    if r > 0:
        print(i,"--------",r,"--------",h) 
NUMBER OF EMPTY RECORDS vs. FULL RECORDS
----------------------------------------
Age -------- 177 -------- 714
Cabin -------- 687 -------- 204
Embarked -------- 2 -------- 889
In [5]:
df.fillna(-777, inplace=True)
In [6]:
df = df.dropna(how='any')
df.isnull().sum()
Out[6]:
Unnamed: 0     0
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
In [7]:
df.shape
Out[7]:
(891, 13)

Encodes discrete (categorical) variables

In [8]:
import numpy as np

a,b = df.shape     #<- ile mamy kolumn
b


print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue
    
DISCRETE FUNCTIONS CODED
------------------------
Name --- object
Sex --- object
Ticket --- object
Cabin --- object
Embarked --- object

I run the LinearRegression () model

In [9]:
y = df['Survived']
X = df.drop('Survived', axis=1)
In [10]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression() 
 

I create loops for two variables based on LinearRegression ()

In [11]:
c,b = df.shape     #<- ile mamy kolumn
print('b: ',b)

a = list(range(1,b))
print('a :', a)
b:  13
a : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
In [12]:
from sklearn import metrics
b= b-2


for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        
        y = df['Survived']       
        X = df.drop('Survived', axis=1)
        
        #a = X.columns[i]
        
        #b = X.columns[f]
        
        col = X.columns[[i,f]]   #<-- nazwy kolumn
        X = X[col]               #<-- FAKTYCZNE warianty zbioru X
        regr.fit(X, y)
        y_pred = regr.predict(X)
        R = regr.score(X, y)
        R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
        RR2 = R2+R
        if RR2 > 0.72:
        # print(' R2: %.3f' %R2, ' regr.score:%.3f' %regr.score(X, y), col)
            print(' RR2: %.3f' %RR2, col)
        
 RR2: 0.754 Index(['Pclass', 'Sex'], dtype='object')
 RR2: 0.754 Index(['Sex', 'Pclass'], dtype='object')
 RR2: 0.722 Index(['Sex', 'Fare'], dtype='object')
 RR2: 0.733 Index(['Sex', 'Cabin'], dtype='object')
 RR2: 0.722 Index(['Fare', 'Sex'], dtype='object')
 RR2: 0.733 Index(['Cabin', 'Sex'], dtype='object')

I create loops for three variables based on LinearRegression ()

In [13]:
from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
        
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            regr.fit(X, y)
            y_pred = regr.predict(X)
            R = regr.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
            RR2 = R2+R
            if RR2 >= 0.757:
       
                print(' RR2: %.3f' %RR2, col)
        
 RR2: 0.758 Index(['Pclass', 'Sex', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'SibSp', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Cabin', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Pclass', 'Embarked', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'SibSp'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Cabin'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.758 Index(['Sex', 'SibSp', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Cabin', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Sex', 'Embarked', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['SibSp', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Cabin', 'Sex', 'Pclass'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Pclass', 'Sex'], dtype='object')
 RR2: 0.758 Index(['Embarked', 'Sex', 'Pclass'], dtype='object')

I create loops for four variables based on LinearRegression ()

In [14]:
from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
            for r in range (1,b):
                r = a[r]
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g,r]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            regr.fit(X, y)
            y_pred = regr.predict(X)
            R = regr.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred))
            RR2 = R2+R
            if RR2 >= 0.761:
       
                print(' RR2: %.3f' %RR2, col)
        
 RR2: 0.762 Index(['Pclass', 'Sex', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Pclass', 'Cabin', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Pclass', 'Cabin', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Sex', 'Cabin', 'Pclass', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Pclass', 'Sex', 'Embarked'], dtype='object')
 RR2: 0.762 Index(['Cabin', 'Sex', 'Pclass', 'Embarked'], dtype='object')

I am starting the RandomForestRegressor model

In [15]:
y = df['Survived']       
X = df.drop('Survived', axis=1)
print(X.shape)
print(y.shape)
(891, 12)
(891,)
In [16]:
from sklearn.ensemble import RandomForestRegressor

model_RFC1 = RandomForestRegressor().fit(X, y)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [17]:
from sklearn import metrics

for i in range(1,b):
    i = a[i]
    for f in range (1,b):
        f = a[f]
        for g in range (1,b):
            g = a[g]
        
        
            y = df['Survived']       
            X = df.drop('Survived', axis=1)
        
            
            col = X.columns[[i,f,g]]   #<-- nazwy kolumn
            X = X[col]               #<-- FAKTYCZNE warianty zbioru X
            model_RFC1.fit(X, y)
            y_pred2 = model_RFC1.predict(X)
            R = model_RFC1.score(X, y)
            R2 = np.sqrt(metrics.mean_squared_error(y, y_pred2))
            RR2 = R2+R
            if RR2 >= 1.05:
       
                print(' RR2: %.3f' %RR2, col)
 RR2: 1.051 Index(['Name', 'Ticket', 'Sex'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Name', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Name', 'Fare'], dtype='object')
 RR2: 1.050 Index(['Sex', 'Age', 'Ticket'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Ticket', 'Name'], dtype='object')
 RR2: 1.052 Index(['Sex', 'Fare', 'Name'], dtype='object')
 RR2: 1.051 Index(['Ticket', 'Name', 'Sex'], dtype='object')
 RR2: 1.053 Index(['Ticket', 'Sex', 'Name'], dtype='object')
 RR2: 1.052 Index(['Fare', 'Sex', 'Name'], dtype='object')

My own tractor comes to similar conclusions as other tools in the series: Feature Selection Techniques.
Only that my tractor is probably faster in calculations …..