Data standardization – Tensorflow linear regression model

EN231220191405

Theoretical explanation

The standardization of data for machine learning models involves the transformation of primary data so that their distribution has an average value of 0 and a standard deviation of 1. The average column value will be subtracted from each value in the data column, and then what will come out will be divided by the standard deviation of the data column . The described process applies to each column separately.

In [1]:
import tensorflow as tf
import pandas as pd
from sklearn import model_selection
import numpy as np

df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours'])
df.head(3)

Out[1]:
CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Month Weekday Hours
0 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 3 2 18
1 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 3 2 19
2 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 3 2 20

In the previous post I built a linear regression model that had a square of 90%. This time I will do several treatments, which I hope will improve the effectiveness of the model.

http://sigmaquality.pl/python/linear-regression-3/

I change the names of the columns to use the Tensorflow model

In [2]:
df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']

Standardization of data in Sklearn

In [3]:
df.head(2)
Out[3]:
CO_GT PT08.S1_CO C6H6_GT PT08.S2_NMHC NOx_GT PT08.S3_NOx NO2_GT PT08.S4_NO2 PT08.S5_O3 T RH AH Month Weekday Hours
0 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 3 2 18
1 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 3 2 19

Matrix created from the table above.

In [4]:
a = np.array(df)
a
Out[4]:
array([[   2.6, 1360. ,   11.9, ...,    3. ,    2. ,   18. ],
       [   2. , 1292. ,    9.4, ...,    3. ,    2. ,   19. ],
       [   2.2, 1402. ,    9. , ...,    3. ,    2. ,   20. ],
       ...,
       [   2.4, 1142. ,   12.4, ...,    4. ,    0. ,   12. ],
       [   2.1, 1003. ,    9.5, ...,    4. ,    0. ,   13. ],
       [   2.2, 1071. ,   11.9, ...,    4. ,    0. ,   14. ]])

The average of the columns is:

In [5]:
np.mean(a, axis=0)
Out[5]:
array([2.09193117e+00, 1.10273036e+03, 1.01903922e+01, 9.42548253e+02,
       2.34058566e+02, 8.32742225e+02, 1.09698942e+02, 1.45301453e+03,
       1.03051192e+03, 1.83173560e+01, 4.88174308e+01, 1.01738155e+00,
       6.31035588e+00, 3.00993908e+00, 1.14985572e+01])

The standard deviation of the columns is:

In [6]:
np.std(a, axis=0)
Out[6]:
array([1.43839252e+00, 2.19576367e+02, 7.56536693e+00, 2.69566963e+02,
       2.04971518e+02, 2.55695758e+02, 4.75175481e+01, 3.47415518e+02,
       4.10894801e+02, 8.82141160e+00, 1.73533985e+01, 4.04807227e-01,
       3.43797585e+00, 2.00021575e+00, 6.92281165e+00])

We transform the data into a standard normal distribution with an average of 0 and standard deviation of 1.

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(a)
k = scaler.transform(df)
k
Out[7]:
array([[ 0.35321987,  1.17166361,  0.22597817, ..., -0.96287933,
        -0.50491507,  0.93913327],
       [-0.06391244,  0.86197636, -0.10447507, ..., -0.96287933,
        -0.50491507,  1.08358325],
       [ 0.07513167,  1.36294102, -0.15734759, ..., -0.96287933,
        -0.50491507,  1.22803323],
       ...,
       [ 0.21417577,  0.17884273,  0.29206882, ..., -0.6720105 ,
        -1.50480721,  0.0724334 ],
       [ 0.00560961, -0.45419443, -0.09125694, ..., -0.6720105 ,
        -1.50480721,  0.21688338],
       [ 0.07513167, -0.14450718,  0.22597817, ..., -0.6720105 ,
        -1.50480721,  0.36133336]])

The mean and standard deviation of the columns after standardization.

In [8]:
print("standard deviation: ",np.std(k, axis=0))
#k=k.astype(int)
print()
print("mean: ",np.mean(k, axis=0))
standard deviation:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

mean:  [ 4.55622145e-17 -2.27811073e-16 -2.42998478e-17 -2.00473744e-16
  3.64497716e-17  1.64023972e-16 -7.28995433e-17  2.55148401e-16
 -1.15424277e-16 -2.06548706e-16  1.21499239e-16 -2.42998478e-16
  7.28995433e-17 -5.90410363e-17 -3.95821739e-17]

We substitute standardized data for Tensorflow

In [9]:
conc = np.vstack(k)
conc
Out[9]:
array([[ 0.35321987,  1.17166361,  0.22597817, ..., -0.96287933,
        -0.50491507,  0.93913327],
       [-0.06391244,  0.86197636, -0.10447507, ..., -0.96287933,
        -0.50491507,  1.08358325],
       [ 0.07513167,  1.36294102, -0.15734759, ..., -0.96287933,
        -0.50491507,  1.22803323],
       ...,
       [ 0.21417577,  0.17884273,  0.29206882, ..., -0.6720105 ,
        -1.50480721,  0.0724334 ],
       [ 0.00560961, -0.45419443, -0.09125694, ..., -0.6720105 ,
        -1.50480721,  0.21688338],
       [ 0.07513167, -0.14450718,  0.22597817, ..., -0.6720105 ,
        -1.50480721,  0.36133336]])
In [10]:
SKS = pd.DataFrame(conc)
SKS.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
In [11]:
SKS.head(3)
Out[11]:
CO_GT PT08.S1_CO C6H6_GT PT08.S2_NMHC NOx_GT PT08.S3_NOx NO2_GT PT08.S4_NO2 PT08.S5_O3 T RH AH Month Weekday Hours
0 0.353220 1.171664 0.225978 0.383770 -0.332039 0.873138 0.069470 0.687895 0.577978 -0.534762 0.004758 -0.641247 -0.962879 -0.504915 0.939133
1 -0.063912 0.861976 -0.104475 0.046192 -0.639399 1.334624 -0.372472 0.305068 -0.142401 -0.568770 -0.064393 -0.721038 -0.962879 -0.504915 1.083583
2 0.075132 1.362941 -0.157348 -0.013163 -0.502795 1.201654 0.090515 0.293555 0.105838 -0.727475 0.298649 -0.660022 -0.962879 -0.504915 1.228033

Teraz dodajemy wartość wynikowa która nie została wystandaryzowana.

Now that we have standardized data, we can create a new Tensorflow linear regression model

Tensorflow linear regression model without standardization

We take the original data without standardization

Step 1. Divides the data into a test and training set</span>

In [12]:
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)

print(df_train.shape, df_test.shape)
(7486, 15) (1871, 15)
In [13]:
df_train.head(3)
Out[13]:
CO_GT PT08.S1_CO C6H6_GT PT08.S2_NMHC NOx_GT PT08.S3_NOx NO2_GT PT08.S4_NO2 PT08.S5_O3 T RH AH Month Weekday Hours
6632 2.6 1099.0 10.4 994.0 401.0 715.0 117.0 1164.0 1186.0 6.8 57.8 0.5768 12 6 2
7123 2.2 1149.0 8.4 914.0 382.0 742.0 147.0 1072.0 1242.0 9.5 41.2 0.4908 1 5 13
7599 5.7 1578.0 29.0 1527.0 875.0 419.0 179.0 1761.0 2086.0 7.9 60.0 0.6406 1 4 9

Step 2. Converts data to Tensorflow format

In [14]:
COL = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']


features = [tf.feature_column.numeric_column(k) for k in COL]
features
Out[14]:
[_NumericColumn(key='CO_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 3. Tensorflow Linear Regression Estimator

katalog: train_Wojtek

In [15]:
model = tf.estimator.LinearRegressor(model_dir="train_Wojtek", feature_columns=features)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'train_Wojtek', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001E130F139E8>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Step 4. Defining how to feed the model and what is the result variable

In [16]:
FEATURES = ['CO_GT','PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL= 'CO_GT'

def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):
    return tf.estimator.inputs.pandas_input_fn(
       x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
       y = pd.Series(data_set[LABEL].values),
       batch_size=n_batch,   
       num_epochs=num_epochs,
       shuffle=shuffle)

Step 5. Training the model

In [17]:
model.train(input_fn=get_input_fn(df_train, 
                                      num_epochs=None,
                                      n_batch = 128,
                                      shuffle=False),
                                      steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from train_Wojtekmodel.ckpt-32000
INFO:tensorflow:Saving checkpoints for 32001 into train_Wojtekmodel.ckpt.
INFO:tensorflow:loss = 5.8126163, step = 32001
INFO:tensorflow:global_step/sec: 263.507
INFO:tensorflow:loss = 3.952513, step = 32101 (0.379 sec)
INFO:tensorflow:global_step/sec: 299.618
INFO:tensorflow:loss = 3.8751945, step = 32201 (0.349 sec)
INFO:tensorflow:global_step/sec: 282.839
INFO:tensorflow:loss = 4.1194134, step = 32301 (0.338 sec)
INFO:tensorflow:global_step/sec: 303.789
INFO:tensorflow:loss = 3.1714904, step = 32401 (0.329 sec)
INFO:tensorflow:global_step/sec: 298.869
INFO:tensorflow:loss = 4.1107397, step = 32501 (0.335 sec)
INFO:tensorflow:global_step/sec: 300.399
INFO:tensorflow:loss = 1.9511085, step = 32601 (0.349 sec)
INFO:tensorflow:global_step/sec: 294.727
INFO:tensorflow:loss = 4.6305227, step = 32701 (0.339 sec)
INFO:tensorflow:global_step/sec: 296.434
INFO:tensorflow:loss = 4.6675496, step = 32801 (0.322 sec)
INFO:tensorflow:global_step/sec: 296.704
INFO:tensorflow:loss = 3.619099, step = 32901 (0.337 sec)
INFO:tensorflow:Saving checkpoints for 33000 into train_Wojtekmodel.ckpt.
INFO:tensorflow:Loss for final step: 3.9190884.
Out[17]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x1e130f13630>

Step 6. Model assessment

In [18]:
ev = model.evaluate(    
          input_fn=get_input_fn(df_test,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-12-23-14:53:48
INFO:tensorflow:Restoring parameters from train_Wojtekmodel.ckpt-33000
INFO:tensorflow:Finished evaluation at 2019-12-23-14:53:49
INFO:tensorflow:Saving dict for global step 33000: average_loss = 0.03159822, global_step = 33000, loss = 9.853379

Calculation of R2

I make a prediction on a test set.

In [19]:
y = model.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [20]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 3000))
#print("Predictions: {}".format(str(predictions)))
INFO:tensorflow:Restoring parameters from train_Wojtekmodel.ckpt-33000

Przekształcam wynik na dataframe

In [21]:
import numpy as np

conc = np.vstack(predictions)
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[21]:
dtype('float32')

I agree on the data formats for the theoretical and empirical variable

In [22]:
y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype
Out[22]:
dtype('float32')
In [23]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[23]:
y         float32
y_pred    float32
dtype: object
In [24]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2
In [25]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
r2
Out[25]:
<tf.Tensor 'Sub_2:0' shape=() dtype=float32>
In [26]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  0.9838469

Tensorflow linear regression model with standardization

We are now substituting standardized data

WE STANDARDIZE INDEPENDENT VARIABLES ONLY BECAUSE WE WANT TO HAVE EASY TO READ THE RESULT IN THE LINEAR REGRESSION MODEL

Step 1. Divides the data into a test and training set

In [27]:
del SKS['CO_GT']
WKD = pd.concat([df['CO_GT'], SKS], axis=1, sort=False)

WKD.head(3)
Out[27]:
CO_GT PT08.S1_CO C6H6_GT PT08.S2_NMHC NOx_GT PT08.S3_NOx NO2_GT PT08.S4_NO2 PT08.S5_O3 T RH AH Month Weekday Hours
0 2.6 1.171664 0.225978 0.383770 -0.332039 0.873138 0.069470 0.687895 0.577978 -0.534762 0.004758 -0.641247 -0.962879 -0.504915 0.939133
1 2.0 0.861976 -0.104475 0.046192 -0.639399 1.334624 -0.372472 0.305068 -0.142401 -0.568770 -0.064393 -0.721038 -0.962879 -0.504915 1.083583
2 2.2 1.362941 -0.157348 -0.013163 -0.502795 1.201654 0.090515 0.293555 0.105838 -0.727475 0.298649 -0.660022 -0.962879 -0.504915 1.228033
In [28]:
df_trainS=WKD.sample(frac=0.8,random_state=200)
df_testS=WKD.drop(df_trainS.index)

print(df_trainS.shape, df_testS.shape)
df_trainS.head(3)
(7486, 15) (1871, 15)
Out[28]:
CO_GT PT08.S1_CO C6H6_GT PT08.S2_NMHC NOx_GT PT08.S3_NOx NO2_GT PT08.S4_NO2 PT08.S5_O3 T RH AH Month Weekday Hours
6632 2.6 -0.016989 0.027706 0.190868 0.814462 -0.460478 0.153650 -0.831899 0.378413 -1.305614 0.517626 -1.088374 1.654940 1.494869 -1.372066
7123 2.2 0.210722 -0.236656 -0.105904 0.721766 -0.354884 0.784995 -1.096711 0.514701 -0.999540 -0.438959 -1.300821 -1.544617 0.994923 0.216883
7599 5.7 2.164484 2.486278 2.168113 3.126978 -1.618104 1.458431 0.886505 2.568755 -1.180917 0.644402 -0.930768 -1.544617 0.494977 -0.360917

Step 2. Converts data to Tensorflow format

In [29]:
FOL = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']


PCK = [tf.feature_column.numeric_column(k) for k in FOL]
PCK
Out[29]:
[_NumericColumn(key='CO_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 3. Tensorflow Linear Regression Estimator

katalog: train_Wojtek

In [30]:
model = tf.estimator.LinearRegressor(model_dir="train_Wojtek7", feature_columns=PCK)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'train_Wojtek7', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001E131AA1358>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

Step 4. Defining how to feed the model and what is the result variable

Step 5. Training the model

In [31]:
model.train(input_fn=get_input_fn(df_trainS, 
                                      num_epochs=None,
                                      n_batch = 128,
                                      shuffle=False),
                                      steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from train_Wojtek7model.ckpt-5000
INFO:tensorflow:Saving checkpoints for 5001 into train_Wojtek7model.ckpt.
INFO:tensorflow:loss = 5.9656813e-09, step = 5001
INFO:tensorflow:global_step/sec: 275.63
INFO:tensorflow:loss = 4.005897e-09, step = 5101 (0.370 sec)
INFO:tensorflow:global_step/sec: 322.183
INFO:tensorflow:loss = 2.7887523e-09, step = 5201 (0.318 sec)
INFO:tensorflow:global_step/sec: 257.555
INFO:tensorflow:loss = 3.2279417e-09, step = 5301 (0.388 sec)
INFO:tensorflow:global_step/sec: 258.823
INFO:tensorflow:loss = 2.1228068e-09, step = 5401 (0.371 sec)
INFO:tensorflow:global_step/sec: 223.394
INFO:tensorflow:loss = 1.5789352e-09, step = 5501 (0.448 sec)
INFO:tensorflow:global_step/sec: 219.565
INFO:tensorflow:loss = 1.301308e-09, step = 5601 (0.455 sec)
INFO:tensorflow:global_step/sec: 204.645
INFO:tensorflow:loss = 1.0119683e-09, step = 5701 (0.501 sec)
INFO:tensorflow:global_step/sec: 191.976
INFO:tensorflow:loss = 7.569053e-10, step = 5801 (0.524 sec)
INFO:tensorflow:global_step/sec: 229.883
INFO:tensorflow:loss = 1.9050654e-09, step = 5901 (0.419 sec)
INFO:tensorflow:Saving checkpoints for 6000 into train_Wojtek7model.ckpt.
INFO:tensorflow:Loss for final step: 3.4729108e-10.
Out[31]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x1e131438860>

Step 6. Model assessment

In [32]:
ev = model.evaluate(    
          input_fn=get_input_fn(df_testS,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-12-23-14:53:56
INFO:tensorflow:Restoring parameters from train_Wojtek7model.ckpt-6000
INFO:tensorflow:Finished evaluation at 2019-12-23-14:53:57
INFO:tensorflow:Saving dict for global step 6000: average_loss = 2.491106e-12, global_step = 6000, loss = 7.768099e-10

Calculation of R2

I make a prediction on a test set.

In [33]:
y = model.predict(    
         input_fn=get_input_fn(df_testS,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [34]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 3000))
INFO:tensorflow:Restoring parameters from train_Wojtek7model.ckpt-6000
In [35]:
import numpy as np

conc = np.vstack(predictions)
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[35]:
dtype('float32')
In [36]:
y = df_testS['CO_GT'].values
y = y.astype('float32')
y.dtype
Out[36]:
dtype('float32')
In [37]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[37]:
y         float32
y_pred    float32
dtype: object
In [38]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2
In [39]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
r2
Out[39]:
<tf.Tensor 'Sub_5:0' shape=() dtype=float32>
In [40]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  1.0