Tensorflow – Calculation of R square for linear regression

Parking Birmingham occupancy

In [35]:
import pandas as pd

df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)
Out[35]:
SystemCodeNumber Capacity Occupancy LastUpdated
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42
In [2]:
df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes
Out[2]:
SystemCodeNumber            object
Capacity                     int64
Occupancy                    int64
LastUpdated         datetime64[ns]
dtype: object
In [3]:
df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday
In [4]:
df.head(4)
Out[4]:
SystemCodeNumber Capacity Occupancy LastUpdated month hour weekday_name weekday
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42 10 7 Tuesday 1
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42 10 8 Tuesday 1
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42 10 8 Tuesday 1
3 BHMBCCMKT01 577 107 2016-10-04 09:32:46 10 9 Tuesday 1
In [5]:
df = df.loc[df['SystemCodeNumber']=='BHMMBMMBX01'] 
df.shape
Out[5]:
(1312, 8)
In [6]:
import tensorflow as tf
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagestensorflowpythonframeworkdtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

Step 1: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

In [7]:
FEATURES = ['month', 'hour', 'weekday'] 
LABEL = 'Occupancy'
In [8]:
PKS = [tf.feature_column.numeric_column(k) for k in FEATURES] 
PKS
Out[8]:
[_NumericColumn(key='month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='hour', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 2: Defining the estimator

Tensorflow will automatically create a file called “ABC” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

In [9]:
estimator = tf.estimator.LinearRegressor( feature_columns=PKS, model_dir="ABC")
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'ABC', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000147BB11B940>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

In [10]:
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True): 
    return tf.estimator.inputs.pandas_input_fn( x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
                                               y = pd.Series(data_set[LABEL].values), batch_size=n_batch, num_epochs=num_epochs, shuffle=shuffle)

Step 3: Model training

  • To feed the model you can use the function created above: get_input_fn.
  • Then you instruct the model to iterate 1000 times.
  • Remember that you do not specify the number of epochs (num_epochs).
  • It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

In [11]:
df_train=df.sample(frac=0.8,random_state=200) 
df_test=df.drop(df_train.index) 
print(df_train.shape, df_test.shape)
(1050, 8) (262, 8)
In [12]:
estimator.train(input_fn=get_input_fn(df_train, num_epochs=None, n_batch = 128, shuffle=False), steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from ABCmodel.ckpt-20000
INFO:tensorflow:Saving checkpoints for 20001 into ABCmodel.ckpt.
INFO:tensorflow:loss = 1604473.0, step = 20001
INFO:tensorflow:global_step/sec: 524.813
INFO:tensorflow:loss = 1890832.8, step = 20101 (0.191 sec)
INFO:tensorflow:global_step/sec: 595.828
INFO:tensorflow:loss = 1691072.0, step = 20201 (0.183 sec)
INFO:tensorflow:global_step/sec: 581.214
INFO:tensorflow:loss = 1660972.2, step = 20301 (0.172 sec)
INFO:tensorflow:global_step/sec: 577.628
INFO:tensorflow:loss = 1830299.8, step = 20401 (0.158 sec)
INFO:tensorflow:global_step/sec: 591.553
INFO:tensorflow:loss = 1564311.5, step = 20501 (0.169 sec)
INFO:tensorflow:global_step/sec: 659.048
INFO:tensorflow:loss = 1851407.0, step = 20601 (0.167 sec)
INFO:tensorflow:global_step/sec: 565.153
INFO:tensorflow:loss = 1717692.1, step = 20701 (0.161 sec)
INFO:tensorflow:global_step/sec: 597.055
INFO:tensorflow:loss = 1668234.1, step = 20801 (0.167 sec)
INFO:tensorflow:global_step/sec: 597.223
INFO:tensorflow:loss = 1785292.5, step = 20901 (0.167 sec)
INFO:tensorflow:Saving checkpoints for 21000 into ABCmodel.ckpt.
INFO:tensorflow:Loss for final step: 1761262.6.
Out[12]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x147bb11bd30>

Step 4. Model evaluation

To enter a test set, use the following code:

In [13]:
ev = estimator.evaluate( input_fn=get_input_fn(df_test, num_epochs=1, n_batch = 128, shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-12-03-10:35:11
INFO:tensorflow:Restoring parameters from ABCmodel.ckpt-21000
INFO:tensorflow:Finished evaluation at 2019-12-03-10:35:11
INFO:tensorflow:Saving dict for global step 21000: average_loss = 12334.496, global_step = 21000, loss = 1077212.6

Step 5. Calculation of R Square

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

In [14]:
y = estimator.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [15]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))
INFO:tensorflow:Restoring parameters from ABCmodel.ckpt-21000
In [16]:
predictions
Out[16]:
[array([319.3249], dtype=float32),
 array([437.01642], dtype=float32),
 array([476.24692], dtype=float32),
 array([495.86215], dtype=float32),

 The model gave us a result string y. I am now processing this result string into a list.
In [17]:
import numpy as np

conc = np.vstack(predictions)
conc
Out[17]:
array([[319.3249 ],
       [437.01642],
       [476.24692],
       [495.86215],
       [326.4933 ],
       [424.56955],
       [444.1848 ],
      
In [18]:
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[18]:
dtype('float32')

Now I’m creating a list of real y values from the test set.

In [19]:
y = df_test['Occupancy'].values
y = y.astype('float32')
y.dtype
Out[19]:
dtype('float32')

Now I create a dataframe with y-real and y-predicted variables.

In [20]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[20]:
y         float32
y_pred    float32
dtype: object
In [21]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2

To use this function, both variables must have the same data type.

In [22]:
y.dtype
Out[22]:
dtype('float32')
In [23]:
kot.dtype
Out[23]:
dtype('float32')
In [24]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
In [25]:
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
In [26]:
r2 = tf.subtract(1.0, tf.div(residual, total))
In [27]:
r2
Out[27]:
<tf.Tensor 'Sub_2:0' shape=() dtype=float32>
In [28]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  0.13424665

Calculation of R Square parameter using Pandas

In [29]:
PZU.head(5)
Out[29]:
y y_pred
0 264.0 319.324890
1 651.0 437.016418
2 572.0 476.246918
3 471.0 495.862152
4 282.0 326.493286
In [30]:
PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)
Out[30]:
y y_pred SSE
0 264.0 319.324890 3060.843506
1 651.0 437.016418 45788.972656
2 572.0 476.246918 9168.652344

Point 2. We calculate the average empirical value of y

In [31]:
PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)
Out[31]:
y y_pred SSE ave_y
0 264.0 319.324890 3060.843506 463.973297
1 651.0 437.016418 45788.972656 463.973297
2 572.0 476.246918 9168.652344 463.973297

Point 3. We calculate the difference between empirical values y and the average of empirical values y

In [32]:
PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)
Out[32]:
y y_pred SSE ave_y SST
0 264.0 319.324890 3060.843506 463.973297 39989.320312
1 651.0 437.016418 45788.972656 463.973297 34978.988281
2 572.0 476.246918 9168.652344 463.973297 11669.768555

Point 4. We calculate the difference between sum of SST and sum of SSE

In [33]:
Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Sum_SST : 3732746.8
Sum_SSE : 3231638.2

Point 5. We calculate the R Square parameter

In [34]:
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)
R Square parameter:  0.13424659
In [ ]: