Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4)

We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the data file: AirQ_filled2.csv comes from previous episodes of this cycle.

In [1]:
import tensorflow as tf
import pandas as pd

df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours'])
df.head(3)

Out[1]:
CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Month Weekday Hours
0 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 3 2 18
1 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 3 2 19
2 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 3 2 20

Step 1: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

Separation of a column into an independent variable and a dependent variable.

In [2]:
df.columns
Out[2]:
Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'Month', 'Weekday', 'Hours'],
      dtype='object')
In [3]:
df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
In [4]:
df.dtypes
Out[4]:
CO_GT           float64
PT08.S1_CO      float64
C6H6_GT         float64
PT08.S2_NMHC    float64
NOx_GT          float64
PT08.S3_NOx     float64
NO2_GT          float64
PT08.S4_NO2     float64
PT08.S5_O3      float64
T               float64
RH              float64
AH              float64
Month             int64
Weekday           int64
Hours             int64
dtype: object
In [5]:
FEATURES = ['PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL = 'CO_GT'
In [6]:
PKS = [tf.feature_column.numeric_column(k) for k in FEATURES]
PKS
Out[6]:
[_NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 2: Defining the estimator

Tensorflow will automatically create a file called “Air” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

In [7]:
estimator = tf.estimator.LinearRegressor(    
        feature_columns=PKS,   
        model_dir="Air")
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'Air', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000017E850F7CC0>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

In [8]:
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):    
         return tf.estimator.inputs.pandas_input_fn(       
         x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),       
         y = pd.Series(data_set[LABEL].values),       
         batch_size=n_batch,          
         num_epochs=num_epochs,       
         shuffle=shuffle)

Step 3: Model training

- To feed the model you can use the function created above: get_input_fn.
- Then you instruct the model to iterate 1000 times.
- Remember that you do not specify the number of epochs (num_epochs).
- It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

In [9]:
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
(7486, 15) (1871, 15)
In [10]:
estimator.train(input_fn=get_input_fn(df_train,                                       
                                           num_epochs=None,                                      
                                           n_batch = 128,                                      
                                           shuffle=False),                                      
                                           steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.
Out[10]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x17e850f7828>

Step 4. Model evaluation

To enter a test set, use the following code:

In [11]:
ev = estimator.evaluate(    
          input_fn=get_input_fn(df_test,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Print the loss using by the code below:

In [12]:
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))	
Loss: 59.043362

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

In [13]:
y = estimator.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [14]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
In [15]:
predictions
Out[15]:
[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

The model gave us a result string y. I am now processing this result string into a list.

In [16]:
import numpy as np

conc = np.vstack(predictions)
conc
Out[16]:
array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)
In [48]:
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[48]:
dtype('float32')

Now I’m creating a list of real y values from the test set.

In [50]:
y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype
Out[50]:
dtype('float32')

Now I create a dataframe with y-real and y-predicted variables.

In [47]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[47]:
y         float64
y_pred    float64
dtype: object
In [63]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2

To use this function, both variables must have the same data type.

In [51]:
y.dtype
Out[51]:
dtype('float32')
In [52]:
kot.dtype
Out[52]:
dtype('float32')
In [65]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
In [66]:
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
In [67]:
r2 = tf.subtract(1.0, tf.div(residual, total))
In [68]:
r2
Out[68]:
<tf.Tensor 'Sub_27:0' shape=() dtype=float32>
In [77]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  0.90320766

Calculation of R Square parameter using Pandas

In [78]:
PZU.head(5)
Out[78]:
y y_pred
0 2.2 2.290434
1 1.2 1.419513
2 1.0 0.991711
3 1.5 1.413460
4 1.6 1.471673
In [80]:
PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)
Out[80]:
y y_pred SSE
0 2.2 2.290434 0.008178
1 1.2 1.419513 0.048186
2 1.0 0.991711 0.000069

Point 2. We calculate the average empirical value of y

In [81]:
PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)
Out[81]:
y y_pred SSE ave_y
0 2.2 2.290434 0.008178 2.061304
1 1.2 1.419513 0.048186 2.061304
2 1.0 0.991711 0.000069 2.061304

Point 3. We calculate the difference between empirical values y and the average of empirical values y

In [83]:
PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)
Out[83]:
y y_pred SSE ave_y SST
0 2.2 2.290434 0.008178 2.061304 0.019237
1 1.2 1.419513 0.048186 2.061304 0.741845
2 1.0 0.991711 0.000069 2.061304 1.126366

Point 4. We calculate the difference between sum of SST and sum of SSE

In [84]:
Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Sum_SST : 3659.9984179583107
Sum_SSE : 354.26016629427124

Point 5. We calculate the R Square parameter

In [85]:
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)
R Square parameter:  0.903207562998923