Improving the linear regression model using synthetic variables

Improving the linear regression model using synthetic variables

Parking Birmingham occupancy analysis

Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham

In [1]:
import pandas as pd

df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)
Out[1]:
SystemCodeNumber Capacity Occupancy LastUpdated
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42

We have parking lots in Birmingham?

In [2]:
df['SystemCodeNumber'].value_counts()
Out[2]:
BHMNCPHST01         1312
Broad Street        1312
BHMBCCMKT01         1312
BHMEURBRD01         1312
Shopping            1312
Others-CCCPS119a    1312
Others-CCCPS105a    1312
BHMNCPNST01         1312
Others-CCCPS98      1312
BHMMBMMBX01         1312
Others-CCCPS135a    1312
Others-CCCPS202     1312
BHMBCCTHL01         1312
Others-CCCPS8       1312
BHMBCCSNH01         1294
Others-CCCPS133     1294
BHMNCPLDH01         1292
BHMNCPPLS01         1291
BHMBCCPST01         1276
BHMEURBRD02         1276
NIA South           1204
NIA Car Parks       1204
BHMNCPRAN01         1186
BHMBRCBRG02         1186
BHMBRCBRG03         1186
Bull Ring           1186
BHMBRCBRG01         1186
BHMNCPNHS01         1038
NIA North            162
BHMBRTARC01           88
Name: SystemCodeNumber, dtype: int64

Checking the completeness of the data

In [3]:
df.isnull().sum()
Out[3]:
SystemCodeNumber    0
Capacity            0
Occupancy           0
LastUpdated         0
dtype: int64

Checking data type

In [4]:
df.dtypes
Out[4]:
SystemCodeNumber    object
Capacity             int64
Occupancy            int64
LastUpdated         object
dtype: object

Needs a date variable because it probably reflects the parking space best.

In [5]:
df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes
Out[5]:
SystemCodeNumber            object
Capacity                     int64
Occupancy                    int64
LastUpdated         datetime64[ns]
dtype: object

I create date-related data.

In [6]:
df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday
In [7]:
df.head(4)
Out[7]:
SystemCodeNumber Capacity Occupancy LastUpdated month hour weekday_name weekday
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42 10 7 Tuesday 1
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42 10 8 Tuesday 1
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42 10 8 Tuesday 1
3 BHMBCCMKT01 577 107 2016-10-04 09:32:46 10 9 Tuesday 1

I choose parking for analysis.

In [8]:
# BHMBCCMKT01
# BHMNCPNST01
# BHMBCCTHL01 (48)
# BHMMBMMBX01 (34)
# BHMBCCSNH01
df2 = df.loc[df['SystemCodeNumber']=='BHMBCCTHL01'] 
df2.shape
Out[8]:
(1312, 8)

Sklearn linear regression model

In [9]:
X = df2[['month', 'hour', 'weekday'] ].values
y = df2['Occupancy'].values
In [10]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[10]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [11]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
In [12]:
from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.47651714547905455

The value of the r square indicator for the linear regression model of sklearn is a little better but the model is still very weak.

Improving the linear regression model

Let’s start by checking the correlation with the dependent variable.

In [13]:
CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')
Out[13]:
Occupancy
weekday -0.027324
month 0.314196
hour 0.662273
Occupancy 1.000000
Capacity NaN

One of the basic conditions for building a linear regression model is the linear relationship between dependent and independent variables.
It is known that when we are in the city and want to park, the hour is very important. It is difficult to park at peak times. But at the same hour on Saturday there are no cars.
Create a synthetic variable from a combination of weekday and hour. It is known that the chosen hour on Sunday is not the same as the same hour on Wednesday.

We create a synthetic variable: Hour on Weekday (HoW)

In [14]:
df2['combined_col'] = df2[['weekday_name', 'hour']].astype(str).apply('-'.join, axis=1)
df2['HoW'] = pd.Categorical(df2['combined_col']).codes
df2['HoW'].head(3)
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[14]:
3882    57
3883    58
3884    58
Name: HoW, dtype: int8

We’ve created a synthetic variable that is a column containing the coded combination of day of the week and time. We can easily read this code.

In [15]:
df2[['HoW', 'weekday_name', 'hour']].head(4)
Out[15]:
HoW weekday_name hour
3882 57 Tuesday 7
3883 58 Tuesday 8
3884 58 Tuesday 8
3885 59 Tuesday 9
In [16]:
CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')
Out[16]:
Occupancy
weekday -0.027324
HoW 0.059408
month 0.314196
hour 0.662273
Occupancy 1.000000
Capacity NaN

Now I am adding a new synthetic variable to the linear regression model of the sklearn.

In [17]:
X = df2[['month', 'hour', 'weekday','HoW'] ].values
y = df2['Occupancy'].values
In [18]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[18]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [19]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
In [20]:
from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.48074077157048334

As we can see, the introduction of a synthetic variable slightly improved the quality of the model.