
Parking Birmingham occupancy analysis
Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham
import pandas as pd
df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)
We have parking lots in Birmingham?
df['SystemCodeNumber'].value_counts()
Checking the completeness of the data
df.isnull().sum()
Checking data type¶
df.dtypes
Needs a date variable because it probably reflects the parking space best.
df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes
I create date-related data.
df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday
df.head(4)
I choose parking for analysis.
# BHMBCCMKT01
# BHMNCPNST01
# BHMBCCTHL01 (48)
# BHMMBMMBX01 (34)
# BHMBCCSNH01
df2 = df.loc[df['SystemCodeNumber']=='BHMBCCTHL01']
df2.shape
Sklearn linear regression model
X = df2[['month', 'hour', 'weekday'] ].values
y = df2['Occupancy'].values
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
import numpy as np
y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
from sklearn import metrics
print('Mean Squared Error: ', metrics.r2_score(y_test, y_pred))
The value of the r square indicator for the linear regression model of sklearn is a little better but the model is still very weak.
Improving the linear regression model
Let’s start by checking the correlation with the dependent variable.
CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')
One of the basic conditions for building a linear regression model is the linear relationship between dependent and independent variables.
It is known that when we are in the city and want to park, the hour is very important. It is difficult to park at peak times. But at the same hour on Saturday there are no cars.
Create a synthetic variable from a combination of weekday and hour. It is known that the chosen hour on Sunday is not the same as the same hour on Wednesday.
We create a synthetic variable: Hour on Weekday (HoW)
df2['combined_col'] = df2[['weekday_name', 'hour']].astype(str).apply('-'.join, axis=1)
df2['HoW'] = pd.Categorical(df2['combined_col']).codes
df2['HoW'].head(3)
We’ve created a synthetic variable that is a column containing the coded combination of day of the week and time. We can easily read this code.
df2[['HoW', 'weekday_name', 'hour']].head(4)
CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')
Now I am adding a new synthetic variable to the linear regression model of the sklearn.
X = df2[['month', 'hour', 'weekday','HoW'] ].values
y = df2['Occupancy'].values
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
import numpy as np
y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
from sklearn import metrics
print('Mean Squared Error: ', metrics.r2_score(y_test, y_pred))
As we can see, the introduction of a synthetic variable slightly improved the quality of the model.