We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the data file: AirQ_filled2.csv comes from previous episodes of this cycle.
import tensorflow as tf
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
,'Month','Weekday','Hours'])
df.head(3)
Step 1: Convert Data
We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().
Separation of a column into an independent variable and a dependent variable.
df.columns
df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
df.dtypes
FEATURES = ['PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL = 'CO_GT'
PKS = [tf.feature_column.numeric_column(k) for k in FEATURES]
PKS
Step 2: Defining the estimator
Tensorflow will automatically create a file called “Air” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.
estimator = tf.estimator.LinearRegressor(
feature_columns=PKS,
model_dir="Air")
To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
y = pd.Series(data_set[LABEL].values),
batch_size=n_batch,
num_epochs=num_epochs,
shuffle=shuffle)
Step 3: Model training
- To feed the model you can use the function created above: get_input_fn.
- Then you instruct the model to iterate 1000 times.
- Remember that you do not specify the number of epochs (num_epochs).
- It is better to set the number of epochs to none and define the number of iterations.
To test the model, we must divide the data set into a test set and a training set.
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
estimator.train(input_fn=get_input_fn(df_train,
num_epochs=None,
n_batch = 128,
shuffle=False),
steps=1000)
Step 4. Model evaluation
To enter a test set, use the following code:
ev = estimator.evaluate(
input_fn=get_input_fn(df_test,
num_epochs=1,
n_batch = 356,
shuffle=False))
Print the loss using by the code below:
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))
Calculation of R Square parameter using Tensorflow
I make a prediction on a test set
y = estimator.predict(
input_fn=get_input_fn(df_test,
num_epochs=1,
n_batch = 256,
shuffle=False))
import itertools
predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))
predictions
The model gave us a result string y. I am now processing this result string into a list.
import numpy as np
conc = np.vstack(predictions)
conc
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)
kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Now I’m creating a list of real y values from the test set.
y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype
Now I create a dataframe with y-real and y-predicted variables.
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
def R_squared(y, y_pred):
residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
return r2
To use this function, both variables must have the same data type.
y.dtype
kot.dtype
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
r2
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
Calculation of R Square parameter using Pandas
PZU.head(5)
PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)
Point 2. We calculate the average empirical value of y
PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)
Point 3. We calculate the difference between empirical values y and the average of empirical values y
PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)
Point 4. We calculate the difference between sum of SST and sum of SSE
Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Point 5. We calculate the R Square parameter
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)