Parking Birmingham occupancy analysis

Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham

import pandas as pd

df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)

We have parking lots in Birmingham?

df['SystemCodeNumber'].value_counts()

BHMNCPHST01         1312
Broad Street        1312
BHMBCCMKT01         1312
BHMEURBRD01         1312
Shopping            1312
Others-CCCPS119a    1312
Others-CCCPS105a    1312
BHMNCPNST01         1312
Others-CCCPS98      1312
BHMMBMMBX01         1312
Others-CCCPS135a    1312
Others-CCCPS202     1312
BHMBCCTHL01         1312
Others-CCCPS8       1312
BHMBCCSNH01         1294
Others-CCCPS133     1294
BHMNCPLDH01         1292
BHMNCPPLS01         1291
BHMBCCPST01         1276
BHMEURBRD02         1276
NIA South           1204
NIA Car Parks       1204
BHMNCPRAN01         1186
BHMBRCBRG02         1186
BHMBRCBRG03         1186
Bull Ring           1186
BHMBRCBRG01         1186
BHMNCPNHS01         1038
NIA North            162
BHMBRTARC01           88
Name: SystemCodeNumber, dtype: int64

Checking the completeness of the data

df.isnull().sum()

SystemCodeNumber    0
Capacity            0
Occupancy           0
LastUpdated         0
dtype: int64

Checking data type¶

df.dtypes

SystemCodeNumber    object
Capacity             int64
Occupancy            int64
LastUpdated         object
dtype: object

Needs a date variable because it probably reflects the parking space best.

df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes

SystemCodeNumber            object
Capacity                     int64
Occupancy                    int64
LastUpdated         datetime64[ns]
dtype: object

I create date-related data.

df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday

df.head(4)

I choose parking for analysis.

# BHMBCCMKT01
# BHMNCPNST01
# BHMBCCTHL01 (48)
# BHMMBMMBX01 (34)
# BHMBCCSNH01
df2 = df.loc[df['SystemCodeNumber']=='BHMBCCTHL01'] 
df2.shape

(1312, 8)

Sklearn linear regression model

X = df2[['month', 'hour', 'weekday'] ].values
y = df2['Occupancy'].values

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.47651714547905455

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

3882    57
3883    58
3884    58
Name: HoW, dtype: int8

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Squared Error:      0.48074077157048334

The value of the r square indicator for the linear regression model of sklearn is a little better but the model is still very weak.

Improving the linear regression model

Let’s start by checking the correlation with the dependent variable.

CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')

One of the basic conditions for building a linear regression model is the linear relationship between dependent and independent variables.
It is known that when we are in the city and want to park, the hour is very important. It is difficult to park at peak times. But at the same hour on Saturday there are no cars.
Create a synthetic variable from a combination of weekday and hour. It is known that the chosen hour on Sunday is not the same as the same hour on Wednesday.

We create a synthetic variable: Hour on Weekday (HoW)

df2['combined_col'] = df2[['weekday_name', 'hour']].astype(str).apply('-'.join, axis=1)
df2['HoW'] = pd.Categorical(df2['combined_col']).codes
df2['HoW'].head(3)

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

3882    57
3883    58
3884    58
Name: HoW, dtype: int8

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Squared Error:      0.48074077157048334

We’ve created a synthetic variable that is a column containing the coded combination of day of the week and time. We can easily read this code.

df2[['HoW', 'weekday_name', 'hour']].head(4)

CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')

Now I am adding a new synthetic variable to the linear regression model of the sklearn.

X = df2[['month', 'hour', 'weekday','HoW'] ].values
y = df2['Occupancy'].values

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.48074077157048334

As we can see, the introduction of a synthetic variable slightly improved the quality of the model.

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Improving the linear regression model using synthetic variables Part. 1