EN231220191405
Theoretical explanation
The standardization of data for machine learning models involves the transformation of primary data so that their distribution has an average value of 0 and a standard deviation of 1. The average column value will be subtracted from each value in the data column, and then what will come out will be divided by the standard deviation of the data column . The described process applies to each column separately.
import tensorflow as tf
import pandas as pd
from sklearn import model_selection
import numpy as np
df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
,'Month','Weekday','Hours'])
df.head(3)
In the previous post I built a linear regression model that had a square of 90
http://sigmaquality.pl/python/linear-regression-3/
I change the names of the columns to use the Tensorflow model
df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
Standardization of data in Sklearn¶
df.head(2)
Matrix created from the table above.
a = np.array(df)
a
The average of the columns is:
np.mean(a, axis=0)
The standard deviation of the columns is:
np.std(a, axis=0)
We transform the data into a standard normal distribution with an average of 0 and standard deviation of 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(a)
k = scaler.transform(df)
k
The mean and standard deviation of the columns after standardization.
print("standard deviation: ",np.std(k, axis=0))
#k=k.astype(int)
print()
print("mean: ",np.mean(k, axis=0))
We substitute standardized data for Tensorflow
conc = np.vstack(k)
conc
SKS = pd.DataFrame(conc)
SKS.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
SKS.head(3)
Teraz dodajemy wartość wynikowa która nie została wystandaryzowana.
Now that we have standardized data, we can create a new Tensorflow linear regression model
Tensorflow linear regression model without standardization
We take the original data without standardization
Step 1. Divides the data into a test and training set</span>
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
df_train.head(3)
Step 2. Converts data to Tensorflow format¶
COL = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
features = [tf.feature_column.numeric_column(k) for k in COL]
features
Step 3. Tensorflow Linear Regression Estimator
katalog: train_Wojtek
model = tf.estimator.LinearRegressor(model_dir="train_Wojtek", feature_columns=features)
Step 4. Defining how to feed the model and what is the result variable
FEATURES = ['CO_GT','PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL= 'CO_GT'
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
y = pd.Series(data_set[LABEL].values),
batch_size=n_batch,
num_epochs=num_epochs,
shuffle=shuffle)
Step 5. Training the model
model.train(input_fn=get_input_fn(df_train,
num_epochs=None,
n_batch = 128,
shuffle=False),
steps=1000)
Step 6. Model assessment
ev = model.evaluate(
input_fn=get_input_fn(df_test,
num_epochs=1,
n_batch = 356,
shuffle=False))
Calculation of R2
I make a prediction on a test set.
y = model.predict(
input_fn=get_input_fn(df_test,
num_epochs=1,
n_batch = 256,
shuffle=False))
import itertools
predictions = list(p["predictions"] for p in itertools.islice(y, 3000))
#print("Predictions: {}".format(str(predictions)))
Przekształcam wynik na dataframe
import numpy as np
conc = np.vstack(predictions)
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)
kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
I agree on the data formats for the theoretical and empirical variable
y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
def R_squared(y, y_pred):
residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
return r2
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
r2
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
Tensorflow linear regression model with standardization
We are now substituting standardized data
WE STANDARDIZE INDEPENDENT VARIABLES ONLY BECAUSE WE WANT TO HAVE EASY TO READ THE RESULT IN THE LINEAR REGRESSION MODEL
Step 1. Divides the data into a test and training set
del SKS['CO_GT']
WKD = pd.concat([df['CO_GT'], SKS], axis=1, sort=False)
WKD.head(3)
df_trainS=WKD.sample(frac=0.8,random_state=200)
df_testS=WKD.drop(df_trainS.index)
print(df_trainS.shape, df_testS.shape)
df_trainS.head(3)
Step 2. Converts data to Tensorflow format
FOL = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
PCK = [tf.feature_column.numeric_column(k) for k in FOL]
PCK
Step 3. Tensorflow Linear Regression Estimator
katalog: train_Wojtek
model = tf.estimator.LinearRegressor(model_dir="train_Wojtek7", feature_columns=PCK)
Step 4. Defining how to feed the model and what is the result variable
Step 5. Training the model
model.train(input_fn=get_input_fn(df_trainS,
num_epochs=None,
n_batch = 128,
shuffle=False),
steps=1000)
Step 6. Model assessment
ev = model.evaluate(
input_fn=get_input_fn(df_testS,
num_epochs=1,
n_batch = 356,
shuffle=False))
Calculation of R2
I make a prediction on a test set.
y = model.predict(
input_fn=get_input_fn(df_testS,
num_epochs=1,
n_batch = 256,
shuffle=False))
import itertools
predictions = list(p["predictions"] for p in itertools.islice(y, 3000))
import numpy as np
conc = np.vstack(predictions)
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)
kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
y = df_testS['CO_GT'].values
y = y.astype('float32')
y.dtype
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
def R_squared(y, y_pred):
residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
return r2
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
r2 = tf.subtract(1.0, tf.div(residual, total))
r2
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
